Identify viruses¶
Identify paired-end reads of viral origin in metagenomic samples.
Purpose¶
- Identify specific strain of pathogenic virus in infected sample
- Assess virome composition of a sample
Required inputs¶
- Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
- Reference genome in fasta format
|-- reads/original
|-- <sample_1>_R1.fastq.gz
|-- <sample_1>_R2.fastq.gz
|-- <sample_2>_R1.fastq.gz
|-- <sample_2>_R2.fastq.gz
|-- reference/<reference>
|-- <reference>.fa
Generated outputs¶
- Hierarchical interactive pie plot (Krona) for visual assessment of viral composition in the sample
Example¶
How to run example:
cd /usr/local/snakelines/example/genomic
snakemake \
--snakefile ../../snakelines.snake \
--configfile config_viral_identification.yaml \
--use-conda
Example configuration:
sequencing: paired_end
samples: # List of sample categories to be analysed
- name: example.* # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)
report_dir: report/public/01-viral # Generated reports and essential output files would be stored there
threads: 16 # Number of threads to use in analysis
reads: # Prepare reads and quality reports for downstream analysis
preprocess: # Pre-process of reads, eliminate sequencing artifacts, contamination ...
trimmed: # Remove low quality parts of reads
method: trimmomatic # Supported values: trimmomatic
temporary: False # If True, generated files would be removed after successful analysis
crop: 500 # Maximal number of bases in read to keep. Longer reads would be truncated.
quality: 20 # Minimal average quality of read bases to keep (inside sliding window of length 5)
headcrop: 20 # Number of bases to remove from the start of read
minlen: 35 # Minimal length of trimmed read. Shorter reads would be removed.
deduplicated: # Remove fragments with the same sequence (PCR duplicated)
method: fastuniq # Supported values: fastuniq
temporary: False # If True, generated files would be removed after successful analysis
report: # Summary reports of read characteristics to assess their quality
quality_report: # HTML summary report of read quality
method: fastqc # Supported values: fastqc
read_types: # List of preprocess steps for quality reports
- original
- trimmed
- deduplicated
assembly: # Join reads into longer sequences (contigs) based on their overlaps
assembler: # Method for joining reads
method: spades # Supported values: spades
mode: standard # Supported values: standard, meta, plasmid, rna, iontorrent
careful: True # Can not be combined with the meta mode. Tries to reduce number of mismatches and short indels, longer runtime
report: # Summary reports for assembly process and results
quality_report: # Quality of assembled contigs
method: quast # Supported values: quast
assembly_graph: # Visualisation of overlaps between assembled contigs
method: bandage # Supported values: bandage
classification: # Identify genomic source of sequenced reads
contig_based: # Find homologue sequences based on assembled contigs
method: blast # Supported values: blast
reference: # List of reference genomes to search for homology
mhv: # Name of reference genome (reference/mhv/mhv.fa)
query_type: nucleotide # Nucleotide or protein, according to sequence type in input .fa files
target_type: nucleotide # Nucleotide or protein, according to sequence type in blast database
max_target_seqs: 10 # Number of best hit reference sequences from blast database for each input sequence
viral: # Customized methods for identification of viruses
identification: # Identification of contigs with similarity to viral genomes
method: virfinder # Supported values: virfinder
report: # Summary reports of classification results
summary: # Aggregated HTML table with summarized attributes of contigs and homology
method: fasta_summary # Supported values: fasta_summary
max_query_seqs: 20000 # Maximal number of contigs to report (ordered by their length)
max_target_seqs: 5 # Maximal number of homologues from reference genomes to report
min_query_coverage: 0.01 # Show only hits that have at least this proportion of contig mapped to reference
include: # Optional attributes of contigs to report
- virfinder # Probability that contig is from virus
# - coverage # Number of aligned reads per contig
- blast # Homologues identified by Blast against specified reference databases
html: # Attributes applicable only for the HTML report, would NOT be used in the TSV table
seqs_per_page: 100 # Number of table rows (sequences) per page
sort_by: 'Sequence' # Rows would be sorted according to values in this column
sort_how: 'asc' # Values would be sorted in desc(ending) or asc(ending) order
columns: # Show only these attributes, in this order
- Sequence # Names of contigs with link to fasta files with its sequence
- Length # Number of bases
- Compress ratio # Complexity of contig sequence, may be used to filter repetitive sequences
- Coverage # Average number of reads covering each base of contig
- VirFinder pvalue # Probability that contig is from viral genome
- Homologue link # Reference sequences with homology
- Mapped reads # Number of mapped reads to contigs