Call variants

Identify alternations of a genomic material of a sequenced individual with respect to a model, reference genome. That includes small point mutations (SNPs), small insertions and deletions.

Purpose

  • Identify genomic mutation that causes phenotypic trait of interest

  • Identify genomic differences between close species

  • Important for various genetic tests

    • inherited diseases
    • de novo mutations
    • oncological diseases - screening and assessing its type

Required inputs

  • Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.

    • each sample is represented by two gzipped fastq files
    • standard output files of paired-end sequencing
  • reference genome in fasta format

|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz
|-- reference/<reference>
        |-- <reference>.fa

Optional inputs

  • GATK ready dbSNP database in bgzipped VCF format

    • if the bgzipped database file has gz suffix it must be changed to bgz
    • required for base quality recalibration step
    • required for variant calling report
    • optional for GATK-HC variant caller
  • tabix index of the dbSNP database file

|-- reference/<reference>/variants
        |-- dbsnp.vcf.bgz
        |-- dbsnp.vcf.bgz.tbi

Generated outputs

  • List of identified variants in VCF file, filtered by user-defined criteria
  • Summary PDF report to assess quality of reads, mapping and variant calling

Example

How to run example:

cd /usr/local/snakelines/example/genomic

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_variant_calling.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                             # List of sample categories to be analysed
    - name: example.*                # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)
      reference: mhv                 # Reference genome for reads in the category (reference/mhv/mhv.fa)

report_dir: report/public/01-variant # Generated reports and essential output files would be stored there
threads: 16                          # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        decontaminated:             # Eliminate fragments from known artificial source, e.g. contamination by human
            method: bowtie2         # Supported values: bowtie2
            temporary: False        # If True, generated files would be removed after successful analysis
            references:             # List of reference genomes
                - mhv
            keep: True              # Keep reads mapped to references (True) or remove them as contamination (False)

        deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
            method: fastuniq        # Supported values: fastuniq
            temporary: False        # If True, generated files would be removed after successful analysis

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed
                - decontaminated
                - deduplicated

mapping:                            # Find the most similar genomic region to reads in reference (mapping process)
    mapper:                         # Method for mapping
        method: bowtie2             # Supported values: bowtie2
        params: --very-sensitive    # Additional parameters for the method
        only_concordant: False      # Keep only reads with consistently mapped reads from both paired-ends
        temporary: True

    index:                          # Generate .bai index for mapped reads in .bam files
        method: samtools            # Supported values: samtools

    postprocess:                    # Successive steps to refine mapped reads
        sorted:                     # Order reads according to their genomic position
            method: samtools        # Supported values: samtools
            temporary: True         # If True, generated files would be removed after successful analysis

        read_group:                 # Include sample name, flow cell, barcode and lanes to BAM header
            method: custom          # Supported values: custom
            temporary: True         # If True, generated files would be removed after successful analysis

        deduplicated:               # Mark duplicated reads (PCR duplicated)
            method: picard          # Supported values: picard
            temporary: True         # If True, generated files would be removed after successful analysis

        filtered:                     # Eliminate reads that do not meet conditions
            method: bamtools          # Supported values: bamtools
            min_map_quality: 20       # Minimal quality of mapping
            drop_improper_pairs: True # Eliminate reads that do not pass paired-end resolution


    report:                         # Summary reports of mapping process and results
        quality_report:             # HTML summary with quality of mappings
            method: qualimap        # Supported values: qualimap
            map_types:              # List of post-process steps for quality reports
                - filtered

variant:                                    # Identify variation in reads given reference genome
    caller:                                 # Method for variant identification
        method: vardict                     # Supported values: vardict
        hard_filter:                        # Variants that do not pass any of these filters would NOT be present in the VCF file
            min_nonref_allele_freq: 0.05    # Minimal proportion of reads with alternative allele against all observations
            min_alternate_count: 2          # Minimal number of reads with alternative allele
            min_map_quality: 15             # Minimal average mapping quality of reads with alternative allele
        soft_filter:                        # Failing these filters would be indicated in the FILTER field of the VCF file
            min_map_quality: 20             # Minimal average mapping quality of reads with alternative allele
            read_depth: 10                  # Minimal number of reads with alternative allele
            min_nonref_allele_freq: 0.20    # Minimal proportion of reads with alternative allele against all observations
            min_mean_base_quality: 20       # Minimal average base quality of bases that support alternative allele

    report:
        calling:
            method: gatk

        summary:
            method: custom
variant:                                    # Identify variation in reads given reference genome
    caller:                                 # Method for variant identification
        method: vardict                     # Supported values: vardict
        hard_filter:                        # Variants that do not pass any of these filters would NOT be present in the VCF file
            min_nonref_allele_freq: 0.05    # Minimal proportion of reads with alternative allele against all observations
            min_alternate_count: 2          # Minimal number of reads with alternative allele
            min_map_quality: 15             # Minimal average mapping quality of reads with alternative allele
        soft_filter:                        # Failing these filters would be indicated in the FILTER field of the VCF file
            min_map_quality: 20             # Minimal average mapping quality of reads with alternative allele
            read_depth: 10                  # Minimal number of reads with alternative allele
            min_nonref_allele_freq: 0.20    # Minimal proportion of reads with alternative allele against all observations
            min_mean_base_quality: 20       # Minimal average base quality of bases that support alternative allele


report:
    multiqc: True

Planned improvements

  • Call variants only in pre-defined regions (BED file)
  • Aggregate quality statistics of preprocess and mapping with the MultiQC
  • Annotation of variants
  • Filtering of variants based on external annotations
  • Interpretation of variants