Configuration of a pipeline

Snakelines pipeline is defined solely by a configuration file. Configurations typically consist of several steps. At first, define the set of samples and reference genomes to analyse. Next, define where to store summary reports of the analysis. Finally, adjust parameters of the rules in the pipeline. Therefore, this configuration file defines the whole analysis pipeline.

See example at the end of this chapter.

Define the set of samples

Fastq files stored in the reads/original directory can be analysed together. To analyse only a subset of samples configuration must start with samples: attribute. The attribute may contain several group categories, each with its own reference and target panel set. For example, you may specify all samples with the name starting with one to be analysed against hg19 genome, one panel. For sureselect6 samples you may use different genome with different panel.

samples:                  # List of sample categories to be analysed
   # Category with one panel
   - name: one.*          # Regex expression of sample names to be analysed (reads/original/one.*_R1.fastq.gz)
     reference: hg19      # Reference genome for reads in the category (reference/hg19/hg19.fa)
     panel: one           # Bed file with target region coordinates (reference/hg19/annotation/one/regions.bed)

   # Another category with sureselect6 panel
   - name: sureselect6.*  # Regex expression of sample names to be analysed (reads/original/sureselect6.*_R1.fastq.gz)
     reference: hg38      # Reference genome for reads in the category (reference/hg19/hg19.fa)
     panel: sureselect6   # Bed file with target region coordinates (reference/hg19/annotation/sureselect6/regions.bed)

Specifying the mode of

SnakeLines can analyse sequencing reads both from Illumina and Nanopore sequencers, with the possibility to also analyse single end Illumina reads. The configuration file has to contain the platform to specify the source of the reads, so that SnakeLines use appropriate rules. When analysing Illumina reads, further specification about the read pairedness has to be made in the sequencing entry.

Using pre-analysed files

SnakeLines is designed for sequencing reads that are stored in gzipped fastq files. SnakeLines assumes that they are stored in the reads/original directory and have .fastq.gz suffixes, with paired raeds having additionally specified the orientation in the followng format:

|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz

In some cases, original fastq files could be previously analysed outside the SnakeLines framework, and so not be available anymore. Typical example is mapping of the reads to the reference genome outside the SnakeLines, while keeping only the resulting BAM file. In such cases, user may start an analysis, such as quality control or variant calling, from the BAM files without need of the original fastq files. User must however specify where the input files are stored, using location attribute of the samples attribute. The location must correspond to standardised SnakeLines directories. In this case, it is the directory where BAM files would be generated by implemented mappers, e.g. mapping/{reference}/original.

For example, the following configuration would generate quality control reports for all BAM files in the mapping/hg38/original directory.

samples:                            # List of sample categories to be analysed
    - name: one.*
      reference: hg38
      location: mapping/hg38/original/{}.bam

mapping:
    report:                         # Summary reports of mapping process and results
        quality_report:             # HTML summary with quality of mappings
            method: qualimap        # Supported values: qualimap
            map_types:              # List of post-process steps for quality reports
                - original

Report directory

Using Snakemake generally leads to a wide hierarchy of intermediate directories and files. Typically only few final files and reports are used for interpretation. SnakeLines therefore copies relevant files to the separate report directory in the last step of a pipeline. User may specify the desired output directory for reports.

report_dir: report/public/01-exome  # Generated reports and essential output files would be stored there

Email reporting

SnakeLines has a functionality to send email with a slightly configurable short report after the completion of all tasks. To enable this functionality, the following block of configuration should be specified in the configuration file. The email client used is either gmail email client (if email/setup/gmail is specified) or linux native sendmail command (if not). In case of gmail, you have to provide the email address of the gmail account and the password to it in open text, so we recommend to use an email address specifically for this purpose. More info is inline in the configuration below as well as in the default configuration file.

email:                              # Setup email client (will not send emails if not specified)
    setup:                          # Setup the sending
        sendto:                     # Receiver address(es)
            - sampleemail@gmail.com
        gmail:                      # Setup gmail account for sending (the emails will look as to come from this address) - if not provided, try to send through linux "sendmail" command
            login_name: "snakelines.mailclient@gmail.com"   # gmail address for sending emails
            login_pass: "hesielko"                          # gmail password for this address
    onsuccess:                      # Setup emails to send if the analysis succeed
        send: True                  # Send only if true
        list_files: False           # Include list of all generated files
        list_copied: False          # Include list of all copied files
    onerror:                        # Setup emails to send if the analysis failed
        send: True                  # Send only if true
        list_files: True            # Include list of files that should have been generated

Adjust rules parameters

Each step of the analysis has its own configuration that would be used to parametrize the designated rule. The first argument is a method, which may be easily swapped to another supported one by changing configuration. The rest is the set of parameters that would be passed to the designated rule.

reads:                           # Prepare reads and quality reports for downstream analysis
   preprocess:                   # Pre-process of reads, eliminate sequencing artifacts, contamination ...
      trimmed:                   # Remove low quality parts of reads
         method: trimmomatic     # Supported values: trimmomatic
         temporary: False        # If True, generated files would be removed after successful analysis
         crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
         quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
         headcrop: 20            # Number of bases to remove from the start of read
         minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

This example snippet would be roughly translated to

method_config = {'temporary': False,
                 'crop': 500,
                 'quality': 20,
                 'headcrop': 20,
                 'minlen': 35}

# Configuration would be passed to the included, trimmomatic rule
include rules/reads/preprocess/trimmed/trimmomatic.snake

Skipping preprocess steps

Parts of preprocess and postprocess steps may be easily skipped by removal of the corresponding configuration parts. This snipped would at first trim reads, then decontaminate them, and finally deduplicate them.

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        decontaminated:             # Eliminate fragments from known artificial source, e.g. contamination by human
            method: bowtie2         # Supported values: bowtie2
            temporary: False        # If True, generated files would be removed after successful analysis
            references:             # List of reference genomes
                - mhv
            keep: True              # Keep reads mapped to references (True) or remove them as contamination (False)

        deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
            method: fastuniq        # Supported values: fastuniq
            temporary: False        # If True, generated files would be removed after successful analysis

Omitting configuration for the decontamination step would restrict read preprocessing to the trimming and the deduplication step, only.

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
            method: fastuniq        # Supported values: fastuniq
            temporary: False        # If True, generated files would be removed after successful analysis

Example configuration

Example for a basic variant calling pipeline.

sequencing: paired_end
samples:                             # List of sample categories to be analysed
    - name: example.*                # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)
      reference: mhv                 # Reference genome for reads in the category (reference/mhv/mhv.fa)

report_dir: report/public/01-variant # Generated reports and essential output files would be stored there
threads: 16                          # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        decontaminated:             # Eliminate fragments from known artificial source, e.g. contamination by human
            method: bowtie2         # Supported values: bowtie2
            temporary: False        # If True, generated files would be removed after successful analysis
            references:             # List of reference genomes
                - mhv
            keep: True              # Keep reads mapped to references (True) or remove them as contamination (False)

        deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
            method: fastuniq        # Supported values: fastuniq
            temporary: False        # If True, generated files would be removed after successful analysis

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed
                - decontaminated
                - deduplicated

mapping:                            # Find the most similar genomic region to reads in reference (mapping process)
    mapper:                         # Method for mapping
        method: bowtie2             # Supported values: bowtie2
        params: --very-sensitive    # Additional parameters for the method
        only_concordant: False      # Keep only reads with consistently mapped reads from both paired-ends
        temporary: True

    index:                          # Generate .bai index for mapped reads in .bam files
        method: samtools            # Supported values: samtools

    postprocess:                    # Successive steps to refine mapped reads
        sorted:                     # Order reads according to their genomic position
            method: samtools        # Supported values: samtools
            temporary: True         # If True, generated files would be removed after successful analysis

        read_group:                 # Include sample name, flow cell, barcode and lanes to BAM header
            method: custom          # Supported values: custom
            temporary: True         # If True, generated files would be removed after successful analysis

        deduplicated:               # Mark duplicated reads (PCR duplicated)
            method: picard          # Supported values: picard
            temporary: True         # If True, generated files would be removed after successful analysis

        filtered:                     # Eliminate reads that do not meet conditions
            method: bamtools          # Supported values: bamtools
            min_map_quality: 20       # Minimal quality of mapping
            drop_improper_pairs: True # Eliminate reads that do not pass paired-end resolution


    report:                         # Summary reports of mapping process and results
        quality_report:             # HTML summary with quality of mappings
            method: qualimap        # Supported values: qualimap
            map_types:              # List of post-process steps for quality reports
                - filtered

variant:                                    # Identify variation in reads given reference genome
    caller:                                 # Method for variant identification
        method: vardict                     # Supported values: vardict
        hard_filter:                        # Variants that do not pass any of these filters would NOT be present in the VCF file
            min_nonref_allele_freq: 0.05    # Minimal proportion of reads with alternative allele against all observations
            min_alternate_count: 2          # Minimal number of reads with alternative allele
            min_map_quality: 15             # Minimal average mapping quality of reads with alternative allele
        soft_filter:                        # Failing these filters would be indicated in the FILTER field of the VCF file
            min_map_quality: 20             # Minimal average mapping quality of reads with alternative allele
            read_depth: 10                  # Minimal number of reads with alternative allele
            min_nonref_allele_freq: 0.20    # Minimal proportion of reads with alternative allele against all observations
            min_mean_base_quality: 20       # Minimal average base quality of bases that support alternative allele

    report:
        calling:
            method: gatk

        summary:
            method: custom
variant:                                    # Identify variation in reads given reference genome
    caller:                                 # Method for variant identification
        method: vardict                     # Supported values: vardict
        hard_filter:                        # Variants that do not pass any of these filters would NOT be present in the VCF file
            min_nonref_allele_freq: 0.05    # Minimal proportion of reads with alternative allele against all observations
            min_alternate_count: 2          # Minimal number of reads with alternative allele
            min_map_quality: 15             # Minimal average mapping quality of reads with alternative allele
        soft_filter:                        # Failing these filters would be indicated in the FILTER field of the VCF file
            min_map_quality: 20             # Minimal average mapping quality of reads with alternative allele
            read_depth: 10                  # Minimal number of reads with alternative allele
            min_nonref_allele_freq: 0.20    # Minimal proportion of reads with alternative allele against all observations
            min_mean_base_quality: 20       # Minimal average base quality of bases that support alternative allele


report:
    multiqc: True