Call variants¶

Identify alternations of a genomic material of a sequenced individual with respect to a model, reference genome. That includes small point mutations (SNPs), small insertions and deletions.

Purpose¶

Identify genomic mutation that causes phenotypic trait of interest
Identify genomic differences between close species
Important for various genetic tests
- inherited diseases
- de novo mutations
- oncological diseases - screening and assessing its type

Required inputs¶

Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
reference genome in fasta format

|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz
|-- reference/<reference>
        |-- <reference>.fa

Optional inputs¶

GATK ready dbSNP database in bgzipped VCF format
- if the bgzipped database file has gz suffix it must be changed to bgz
- required for base quality recalibration step
- required for variant calling report
- optional for GATK-HC variant caller
tabix index of the dbSNP database file

|-- reference/<reference>/variants
        |-- dbsnp.vcf.bgz
        |-- dbsnp.vcf.bgz.tbi

Generated outputs¶

List of identified variants in VCF file, filtered by user-defined criteria
Summary PDF report to assess quality of reads, mapping and variant calling

Example¶

How to run example:

cd /usr/local/snakelines/example/genomic

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_variant_calling.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                             # List of sample categories to be analysed
    - name: example.*                # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)
      reference: mhv                 # Reference genome for reads in the category (reference/mhv/mhv.fa)

report_dir: report/public/01-variant # Generated reports and essential output files would be stored there
threads: 16                          # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        decontaminated:             # Eliminate fragments from known artificial source, e.g. contamination by human
            method: bowtie2         # Supported values: bowtie2
            temporary: False        # If True, generated files would be removed after successful analysis
            references:             # List of reference genomes
                - mhv
            keep: True              # Keep reads mapped to references (True) or remove them as contamination (False)

        deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
            method: fastuniq        # Supported values: fastuniq
            temporary: False        # If True, generated files would be removed after successful analysis

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed
                - decontaminated
                - deduplicated

mapping:                            # Find the most similar genomic region to reads in reference (mapping process)
    mapper:                         # Method for mapping
        method: bowtie2             # Supported values: bowtie2
        params: --very-sensitive    # Additional parameters for the method
        only_concordant: False      # Keep only reads with consistently mapped reads from both paired-ends
        temporary: True

    index:                          # Generate .bai index for mapped reads in .bam files
        method: samtools            # Supported values: samtools

    postprocess:                    # Successive steps to refine mapped reads
        sorted:                     # Order reads according to their genomic position
            method: samtools        # Supported values: samtools
            temporary: True         # If True, generated files would be removed after successful analysis

        read_group:                 # Include sample name, flow cell, barcode and lanes to BAM header
            method: custom          # Supported values: custom
            temporary: True         # If True, generated files would be removed after successful analysis

        deduplicated:               # Mark duplicated reads (PCR duplicated)
            method: picard          # Supported values: picard
            temporary: True         # If True, generated files would be removed after successful analysis

        filtered:                     # Eliminate reads that do not meet conditions
            method: bamtools          # Supported values: bamtools
            min_map_quality: 20       # Minimal quality of mapping
            drop_improper_pairs: True # Eliminate reads that do not pass paired-end resolution


    report:                         # Summary reports of mapping process and results
        quality_report:             # HTML summary with quality of mappings
            method: qualimap        # Supported values: qualimap
            map_types:              # List of post-process steps for quality reports
                - filtered

variant:                                    # Identify variation in reads given reference genome
    caller:                                 # Method for variant identification
        method: vardict                     # Supported values: vardict
        hard_filter:                        # Variants that do not pass any of these filters would NOT be present in the VCF file
            min_nonref_allele_freq: 0.05    # Minimal proportion of reads with alternative allele against all observations
            min_alternate_count: 2          # Minimal number of reads with alternative allele
            min_map_quality: 15             # Minimal average mapping quality of reads with alternative allele
        soft_filter:                        # Failing these filters would be indicated in the FILTER field of the VCF file
            min_map_quality: 20             # Minimal average mapping quality of reads with alternative allele
            read_depth: 10                  # Minimal number of reads with alternative allele
            min_nonref_allele_freq: 0.20    # Minimal proportion of reads with alternative allele against all observations
            min_mean_base_quality: 20       # Minimal average base quality of bases that support alternative allele

    report:
        calling:
            method: gatk

        summary:
            method: custom
variant:                                    # Identify variation in reads given reference genome
    caller:                                 # Method for variant identification
        method: vardict                     # Supported values: vardict
        hard_filter:                        # Variants that do not pass any of these filters would NOT be present in the VCF file
            min_nonref_allele_freq: 0.05    # Minimal proportion of reads with alternative allele against all observations
            min_alternate_count: 2          # Minimal number of reads with alternative allele
            min_map_quality: 15             # Minimal average mapping quality of reads with alternative allele
        soft_filter:                        # Failing these filters would be indicated in the FILTER field of the VCF file
            min_map_quality: 20             # Minimal average mapping quality of reads with alternative allele
            read_depth: 10                  # Minimal number of reads with alternative allele
            min_nonref_allele_freq: 0.20    # Minimal proportion of reads with alternative allele against all observations
            min_mean_base_quality: 20       # Minimal average base quality of bases that support alternative allele


report:
    multiqc: True

Planned improvements¶

Call variants only in pre-defined regions (BED file)
Aggregate quality statistics of preprocess and mapping with the MultiQC
Annotation of variants
Filtering of variants based on external annotations
Interpretation of variants

Call variants¶

Purpose¶

Required inputs¶

Optional inputs¶

Generated outputs¶

Example¶

Planned improvements¶

Included pipelines¶

Table of Contents

Previous topic

Next topic

This Page