Variant detection in SARS-CoV-2¶
Process paired-end reads from the Illumina COVIDSeq protocol to characterize sequenced SARS-CoV-2 virus.
Purpose¶
- detect genomic variants
- assign the sequenced SARS-CoV-2 into COVID clade
- make consensus sequence of the sequenced virus
Required inputs¶
- Sequences of primers used in PCR amplicon based enrichment (fasta format, artic.fa)
- Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
- Reference genome in fasta format
|-- description/adapters
|-- artic.fa
|-- reads/original
|-- <sample_1>_R1.fastq.gz
|-- <sample_1>_R2.fastq.gz
|-- <sample_2>_R1.fastq.gz
|-- <sample_2>_R2.fastq.gz
|-- reference/<reference>
|-- <reference>.fa
Generated outputs¶
- Consensus sequence of the sequenced SARS-CoV-2 virus
- Detected genomic variants
- Clade assignment
Example¶
How to run example:
Copy Fastq files from the Illumina COVIDSeq sequencing into /usr/local/snakelines/example/covidseq/reads/original. Real-world examples can be also freely downloaded from the ENA portal.
cd /usr/local/snakelines/example/covidseq
snakemake \
--snakefile ../../snakelines.snake \
--configfile config_covidseq.yaml \
--use-conda
Example configuration:
sequencing: paired_end
samples: # List of sample categories to be analysed
- name: .* # Regex expression of sample names to be analysed (reads/original/.*_R1.fastq.gz)
reference: sars_cov_2 # Reference genome for reads in the category (reference/sars_cov_2/sars_cov_2.fa)
report_dir: report/public/01-example # Generated reports and essential output files would be stored there
threads: 16 # Number of threads to use in analysis
reads: # Prepare reads and quality reports for downstream analysis
preprocess: # Pre-process of reads, eliminate sequencing artifacts, contamination ...
trimmed: # Remove low quality parts of reads
method: cutadapt
temporary: True
fasta: reference/sars_cov_2/adapters/artic_v4.fa
minimum_length: 45 # shorter reads are discarded
quality_cutoff: 20,20 # cutoff on 3' end ("20") or cutoff on both 5' and 3' ends ("20,20")
length: 500 # truncate reads to this length
head_cut: 15 # number of bases that are removed from the beginning of each read
tail_cut: 0 # number of bases are removed from the end of each read
decontaminated: # Eliminate fragments from known artificial source, e.g. contamination by human
method: bowtie2 # Supported values: bowtie2
temporary: True # If True, generated files would be removed after successful analysis
references: # List of reference genomes
- grch38_decoy
keep: False # Keep reads mapped to references (True) or remove them as contamination (False)
report: # Summary reports of read characteristics to assess their quality
quality_report: # HTML summary report of read quality
method: fastqc # Supported values: fastqc
read_types: # List of preprocess steps for quality reports
- original
mapping: # Find the most similar genomic region to reads in reference (mapping process)
mapper: # Method for mapping
method: bwa # Supported values: bowtie2, bwa
temporary: True
index: # Generate .bai index for mapped reads in .bam files
method: samtools # Supported values: samtools
postprocess: # Successive steps to refine mapped reads
sorted:
method: samtools
temporary: True
deduplicated: # Mark duplicated reads (PCR duplicated)
method: picard # Supported values: picard
report: # Summary reports of mapping process and results
quality_report: # HTML summary with quality of mappings
method: qualimap # Supported values: qualimap
map_types: # List of post-process steps for quality reports
- deduplicated
variant: # Identify variation in reads given reference genome
caller: # Method for variant identification
method: bcftools
min_base_quality: 0 # Minimum base quality for a base to be considered. [0]
min_mapping_quality: 13 # Minimum mapping quality for an alignment to be used. [13]
keep_alts: True # Report all alleles, not only those which are in called genotypes. [False]
haploid: False # Set predefined ploidy. [False]
redo_baq: True # Recalculate base alignment quality on the fly [False]
adjust_mq: 50 # Coefficient for downgrading mapping quality [0]
gap_frac: 0.05 # Minimum fraction of gapped reads [0.002]
postprocess:
normalized:
method: bcftools
multiallelic: True
filtered:
method: bcftools
snp_gap: 3
indel_gap: 10
set_gts: '.'
include: 'INFO/DP>=5 & GT="alt" & (DP4[2]+DP4[3])/INFO/DP>=0.50'
trim_alt_alleles: True
consensus: # consensus sequence
fasta: # create consesnus sequence based on vcf of a sample and reference fasta
method: bcftools # Allowed: 'bcftools', 'ivar' (consensus per ref regions), 'ivar_nonregional'
mask_lte_coverage: 4 # Valid for 'bcftools'. Mask bases with N where coverage is lower than or equal to parameter.
count_orphans: False # Valid for ivar. True or False (def: False), whether to count anomalous read pairs in variant calling
no_base_alignment_quality: True # Valid for ivar. True or False (def: False), on True disable base alignment quality.
max_depth: 8000 # Valid for ivar. Int (def: 8000), limits the number of reads to load in memory per input file. Setting to 0 removes the limit.
min_mapping_quality: 0 # Valid for ivar. Int (def: 0), minimum mapping quality for an alignment to be used.
min_base_quality: 13 # Valid for ivar. Int (def: 13), minimum base quality for a base to be considered. Setting to 0 make the overlapping bases reappear, albeit with 0 quality.
ivar_quality_threshold: 20 # Valid for ivar. Int, (def: 20), minimum quality score threshold to count base.
ivar_frequency_threshold: 0 # Valid for ivar. Float, range [0-1] (default: 0), minimum frequency threshold to call consensus.
ivar_min_depth: 10 # Valid for ivar. Int (def: 10), minimum depth to call consensus.
summary:
method: cat # concatenate consensus sequences from the samples into a single file
lineage:
caller:
method: pangolin
summary:
method: cat
nextclade:
method: nextclade
reference: sars-cov-2
misc:
mixed_positions:
count:
method: custom
summary:
method: custom
min_coverage: 9
report:
multiqc: True