Variant detection in SARS-CoV-2

Process paired-end reads from the Illumina COVIDSeq protocol to characterize sequenced SARS-CoV-2 virus.

Purpose

  • detect genomic variants
  • assign the sequenced SARS-CoV-2 into COVID clade
  • make consensus sequence of the sequenced virus

Required inputs

  • Sequences of primers used in PCR amplicon based enrichment (fasta format, artic.fa)
  • Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
    • each sample is represented by two gzipped fastq files
    • standard output files of paired-end sequencing
  • Reference genome in fasta format
|-- description/adapters
        |-- artic.fa
|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz
|-- reference/<reference>
        |-- <reference>.fa

Generated outputs

  • Consensus sequence of the sequenced SARS-CoV-2 virus
  • Detected genomic variants
  • Clade assignment

Example

How to run example:

Copy Fastq files from the Illumina COVIDSeq sequencing into /usr/local/snakelines/example/covidseq/reads/original. Real-world examples can be also freely downloaded from the ENA portal.

cd /usr/local/snakelines/example/covidseq

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_covidseq.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                             # List of sample categories to be analysed
    - name: .*                       # Regex expression of sample names to be analysed (reads/original/.*_R1.fastq.gz)
      reference: sars_cov_2          # Reference genome for reads in the category (reference/sars_cov_2/sars_cov_2.fa)

report_dir: report/public/01-example # Generated reports and essential output files would be stored there
threads: 16                          # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: cutadapt
            temporary: True
            fasta: reference/sars_cov_2/adapters/artic_v4.fa
            minimum_length: 45      # shorter reads are discarded
            quality_cutoff: 20,20   # cutoff on 3' end ("20") or cutoff on both 5' and 3' ends ("20,20")
            length: 500             # truncate reads to this length
            head_cut: 15            # number of bases that are removed from the beginning of each read
            tail_cut: 0             # number of bases are removed from the end of each read

        decontaminated:             # Eliminate fragments from known artificial source, e.g. contamination by human
            method: bowtie2         # Supported values: bowtie2
            temporary: True         # If True, generated files would be removed after successful analysis
            references:             # List of reference genomes
                - grch38_decoy
            keep: False              # Keep reads mapped to references (True) or remove them as contamination (False)

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original

mapping:                            # Find the most similar genomic region to reads in reference (mapping process)
    mapper:                         # Method for mapping
        method: bwa                 # Supported values: bowtie2, bwa
        temporary: True

    index:                          # Generate .bai index for mapped reads in .bam files
        method: samtools            # Supported values: samtools

    postprocess:                    # Successive steps to refine mapped reads
        sorted:
            method: samtools
            temporary: True

        deduplicated:               # Mark duplicated reads (PCR duplicated)
            method: picard          # Supported values: picard

    report:                         # Summary reports of mapping process and results
        quality_report:             # HTML summary with quality of mappings
            method: qualimap        # Supported values: qualimap
            map_types:              # List of post-process steps for quality reports
                - deduplicated

variant:                                    # Identify variation in reads given reference genome
    caller:                                 # Method for variant identification
        method: bcftools
        min_base_quality: 0           # Minimum base quality for a base to be considered. [0]
        min_mapping_quality: 13       # Minimum mapping quality for an alignment to be used. [13]
        keep_alts: True               # Report all alleles, not only those which are in called genotypes. [False]
        haploid: False                 # Set predefined ploidy. [False]
        redo_baq: True                # Recalculate base alignment quality on the fly [False]
        adjust_mq: 50                 # Coefficient for downgrading mapping quality [0]
        gap_frac: 0.05                # Minimum fraction of gapped reads [0.002]

    postprocess:
        normalized:
            method: bcftools
            multiallelic: True

        filtered:
            method: bcftools
            snp_gap: 3
            indel_gap: 10
            set_gts: '.'
            include: 'INFO/DP>=5 & GT="alt" & (DP4[2]+DP4[3])/INFO/DP>=0.50'
            trim_alt_alleles: True

consensus:                              # consensus sequence
    fasta:                              # create consesnus sequence based on vcf of a sample and reference fasta
        method: bcftools                # Allowed: 'bcftools', 'ivar' (consensus per ref regions), 'ivar_nonregional'
        mask_lte_coverage: 4            # Valid for 'bcftools'. Mask bases with N where coverage is lower than or equal to parameter.
        count_orphans: False            # Valid for ivar. True or False (def: False), whether to count anomalous read pairs in variant calling
        no_base_alignment_quality: True # Valid for ivar. True or False (def: False), on True disable base alignment quality.
        max_depth: 8000                 # Valid for ivar. Int (def: 8000), limits the number of reads to load in memory per input file. Setting to 0 removes the limit.
        min_mapping_quality: 0          # Valid for ivar. Int (def: 0), minimum mapping quality for an alignment to be used.
        min_base_quality: 13            # Valid for ivar. Int (def: 13), minimum base quality for a base to be considered. Setting to 0 make the overlapping bases reappear, albeit with 0 quality.
        ivar_quality_threshold: 20      # Valid for ivar. Int, (def: 20), minimum quality score threshold to count base.
        ivar_frequency_threshold: 0     # Valid for ivar. Float, range [0-1] (default: 0), minimum frequency threshold to call consensus.
        ivar_min_depth: 10              # Valid for ivar. Int (def: 10), minimum depth to call consensus.

    summary:
        method: cat                 # concatenate consensus sequences from the samples into a single file

lineage:
    caller:
        method: pangolin
    summary:
        method: cat
    nextclade:
        method: nextclade
        reference: sars-cov-2

misc:
    mixed_positions:
        count:
            method: custom

        summary:
            method: custom
            min_coverage: 9

report:
    multiqc: True

Planned improvements

  • Aggregate quality statistics of preprocess and mapping with the MultiQC
  • Assembly based method