Identify viruses

Identify paired-end reads of viral origin in metagenomic samples.

Purpose

  • Identify specific strain of pathogenic virus in infected sample
  • Assess virome composition of a sample

Required inputs

  • Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
    • each sample is represented by two gzipped fastq files
    • standard output files of paired-end sequencing
  • Reference genome in fasta format
|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz
|-- reference/<reference>
        |-- <reference>.fa

Generated outputs

  • Hierarchical interactive pie plot (Krona) for visual assessment of viral composition in the sample

Example

How to run example:

cd /usr/local/snakelines/example/genomic

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_viral_identification.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                            # List of sample categories to be analysed
    - name: example.*               # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)

report_dir: report/public/01-viral         # Generated reports and essential output files would be stored there
threads: 16                         # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
            method: fastuniq        # Supported values: fastuniq
            temporary: False        # If True, generated files would be removed after successful analysis

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed
                - deduplicated

assembly:                           # Join reads into longer sequences (contigs) based on their overlaps
    assembler:                      # Method for joining reads
        method: spades              # Supported values: spades
        mode: standard              # Supported values: standard, meta, plasmid, rna, iontorrent
        careful: True               # Can not be combined with the meta mode. Tries to reduce number of mismatches and short indels, longer runtime

    report:                         # Summary reports for assembly process and results
        quality_report:             # Quality of assembled contigs
            method: quast           # Supported values: quast

        assembly_graph:             # Visualisation of overlaps between assembled contigs
            method: bandage         # Supported values: bandage

classification:                               # Identify genomic source of sequenced reads
    contig_based:                             # Find homologue sequences based on assembled contigs
        method: blast                         # Supported values: blast
        reference:                            # List of reference genomes to search for homology
            mhv:                              # Name of reference genome (reference/mhv/mhv.fa)
                query_type: nucleotide        # Nucleotide or protein, according to sequence type in input .fa files
                target_type: nucleotide       # Nucleotide or protein, according to sequence type in blast database
                max_target_seqs: 10           # Number of best hit reference sequences from blast database for each input sequence

    viral:                                    # Customized methods for identification of viruses
        identification:                       # Identification of contigs with similarity to viral genomes
            method: virfinder                 # Supported values: virfinder

    report:                                   # Summary reports of classification results
        summary:                              # Aggregated HTML table with summarized attributes of contigs and homology
            method: fasta_summary             # Supported values: fasta_summary
            max_query_seqs: 20000             # Maximal number of contigs to report (ordered by their length)
            max_target_seqs: 5                # Maximal number of homologues from reference genomes to report
            min_query_coverage: 0.01          # Show only hits that have at least this proportion of contig mapped to reference
            include:                          # Optional attributes of contigs to report
                - virfinder                   # Probability that contig is from virus
#                - coverage                   # Number of aligned reads per contig
                - blast                       # Homologues identified by Blast against specified reference databases

            html:                             # Attributes applicable only for the HTML report, would NOT be used in the TSV table
                seqs_per_page: 100            # Number of table rows (sequences) per page
                sort_by: 'Sequence'           # Rows would be sorted according to values in this column
                sort_how: 'asc'               # Values would be sorted in desc(ending) or asc(ending) order
                columns:                      # Show only these attributes, in this order
                    - Sequence                # Names of contigs with link to fasta files with its sequence
                    - Length                  # Number of bases
                    - Compress ratio          # Complexity of contig sequence, may be used to filter repetitive sequences
                    - Coverage                # Average number of reads covering each base of contig
                    - VirFinder pvalue        # Probability that contig is from viral genome
                    - Homologue link          # Reference sequences with homology
                    - Mapped reads            # Number of mapped reads to contigs

Planned improvements

  • Aggregate quality statistics of preprocess and mapping with the MultiQC
  • Assembly based method