Identify viruses¶

Identify paired-end reads of viral origin in metagenomic samples.

Purpose¶

Identify specific strain of pathogenic virus in infected sample
Assess virome composition of a sample

Required inputs¶

Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
Reference genome in fasta format

|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz
|-- reference/<reference>
        |-- <reference>.fa

Generated outputs¶

Hierarchical interactive pie plot (Krona) for visual assessment of viral composition in the sample

Example¶

How to run example:

cd /usr/local/snakelines/example/genomic

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_viral_identification.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                            # List of sample categories to be analysed
    - name: example.*               # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)

report_dir: report/public/01-viral         # Generated reports and essential output files would be stored there
threads: 16                         # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
            method: fastuniq        # Supported values: fastuniq
            temporary: False        # If True, generated files would be removed after successful analysis

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed
                - deduplicated

assembly:                           # Join reads into longer sequences (contigs) based on their overlaps
    assembler:                      # Method for joining reads
        method: spades              # Supported values: spades
        mode: standard              # Supported values: standard, meta, plasmid, rna, iontorrent
        careful: True               # Can not be combined with the meta mode. Tries to reduce number of mismatches and short indels, longer runtime

    report:                         # Summary reports for assembly process and results
        quality_report:             # Quality of assembled contigs
            method: quast           # Supported values: quast

        assembly_graph:             # Visualisation of overlaps between assembled contigs
            method: bandage         # Supported values: bandage

classification:                               # Identify genomic source of sequenced reads
    contig_based:                             # Find homologue sequences based on assembled contigs
        method: blast                         # Supported values: blast
        reference:                            # List of reference genomes to search for homology
            mhv:                              # Name of reference genome (reference/mhv/mhv.fa)
                query_type: nucleotide        # Nucleotide or protein, according to sequence type in input .fa files
                target_type: nucleotide       # Nucleotide or protein, according to sequence type in blast database
                max_target_seqs: 10           # Number of best hit reference sequences from blast database for each input sequence

    viral:                                    # Customized methods for identification of viruses
        identification:                       # Identification of contigs with similarity to viral genomes
            method: virfinder                 # Supported values: virfinder

    report:                                   # Summary reports of classification results
        summary:                              # Aggregated HTML table with summarized attributes of contigs and homology
            method: fasta_summary             # Supported values: fasta_summary
            max_query_seqs: 20000             # Maximal number of contigs to report (ordered by their length)
            max_target_seqs: 5                # Maximal number of homologues from reference genomes to report
            min_query_coverage: 0.01          # Show only hits that have at least this proportion of contig mapped to reference
            include:                          # Optional attributes of contigs to report
                - virfinder                   # Probability that contig is from virus
#                - coverage                   # Number of aligned reads per contig
                - blast                       # Homologues identified by Blast against specified reference databases

            html:                             # Attributes applicable only for the HTML report, would NOT be used in the TSV table
                seqs_per_page: 100            # Number of table rows (sequences) per page
                sort_by: 'Sequence'           # Rows would be sorted according to values in this column
                sort_how: 'asc'               # Values would be sorted in desc(ending) or asc(ending) order
                columns:                      # Show only these attributes, in this order
                    - Sequence                # Names of contigs with link to fasta files with its sequence
                    - Length                  # Number of bases
                    - Compress ratio          # Complexity of contig sequence, may be used to filter repetitive sequences
                    - Coverage                # Average number of reads covering each base of contig
                    - VirFinder pvalue        # Probability that contig is from viral genome
                    - Homologue link          # Reference sequences with homology
                    - Mapped reads            # Number of mapped reads to contigs

Planned improvements¶

Aggregate quality statistics of preprocess and mapping with the MultiQC
Assembly based method

Identify viruses¶

Purpose¶

Required inputs¶

Generated outputs¶

Example¶

Planned improvements¶

Included pipelines¶

Table of Contents

Previous topic

Next topic

This Page