Configuration of a pipeline
===========================

Snakelines pipeline is defined solely by a configuration file.
Configurations typically consists of several steps.
At first, define set of samples and reference genomes to analyse.
Next, define where to store summary reports of the analysis.
Finally, adjust parameters of rules in the pipeline.
Therefore, this configuration file defines the whole analysis pipeline.

See `example <#example-configuration>`_ at the end of this chapter.


Define set of samples
---------------------

Fastq files stored in the reads/original directory can be analysed together.
To analyse only a subset of samples configuration must start with `samples:` attribute.
The attribute may contain several group categories, each with its own reference and target panel set.
For example, you may specify to all sample with name started with `one` to be analysed against hg19 genome, one panel.
For `sureselect6` samples you may use different genome and panel.

.. code-block:: yaml

   samples:                  # List of sample categories to be analysed
      # Category with one panel
      - name: one.*          # Regex expression of sample names to be analysed (reads/original/one.*_R1.fastq.gz)
        reference: hg19      # Reference genome for reads in the category (reference/hg19/hg19.fa)
        panel: one           # Bed file with target region coordinates (reference/hg19/annotation/one/regions.bed)

      # Another category with sureselect6 panel
      - name: sureselect6.*  # Regex expression of sample names to be analysed (reads/original/sureselect6.*_R1.fastq.gz)
        reference: hg38      # Reference genome for reads in the category (reference/hg19/hg19.fa)
        panel: sureselect6   # Bed file with target region coordinates (reference/hg19/annotation/sureselect6/regions.bed)

Using pre-analysed files
~~~~~~~~~~~~~~~~~~~~~~~~

SnakeLines is designed for paired-end sequencing reads that are stored in pairs of corresponding gzipped fastq files.
SnakeLines assumes that they are stored in the reads/original directory and have _R1.fastq.gz and _R2.fastq.gz suffices, e.g.

::

   |-- reads/original
           |-- <sample_1>_R1.fastq.gz
           |-- <sample_1>_R2.fastq.gz
           |-- <sample_2>_R1.fastq.gz
           |-- <sample_2>_R2.fastq.gz

In some cases, original fastq files could be previously analysed outside the SnakeLines framework, and so not be available anymore.
Typical example is mapping of the reads to the reference genome outside the SnakeLines, while keeping only the resulting BAM file.
In such cases, user may start analysis, such as quality control or variant calling, from the BAM files without need of original fastq files.
User must however specify, where are input files stored, using `location` attribute of the samples attribute.
The location must correspond to standardised SnakeLines directories.
In this case, it is directory, where BAM files would be generated by implemented mappers, e.g. mapping/{reference}/original.

For example, following configuration would generate quality control reports for all BAM files in the mapping/hg38/original directory.

.. code-block:: yaml

   samples:                            # List of sample categories to be analysed
       - name: one.*
         reference: hg38
         location: mapping/hg38/original/{}.bam

   mapping:
       report:                         # Summary reports of mapping process and results
           quality_report:             # HTML summary with quality of mappings
               method: qualimap        # Supported values: qualimap
               map_types:              # List of post-process steps for quality reports
                   - original

Report directory
----------------

Using Snakemake generally leads to a wide hierarchy of intermediate directories and files.
Typically only few final files and reports are used for interpretation.
SnakeLines therefore copies relevant files to the separate, `report` directory in the last step of a pipeline.
User may define, where to put output reports.

.. code-block:: yaml

   report_dir: report/public/01-exome  # Generated reports and essential output files would be stored there


Email reporting
---------------

SnakeLines has a functionality to send email with a slightly configurable short report after completion of all tasks.
To enable this functionality, the following block of configuration should be specified in the configuration file.
The email client used is either `gmail` email client (if email/setup/gmail is specified) or linux native `sendmail` command (if not).
In case of gmail, you have to provide the email address of the gmail account and the password to it in open text,
so we recommend to use an email address specifically for this purpose. More info is inline in the configuration below as well as in the default configuration file.

.. code-block:: yaml

    email:                              # Setup email client (will not send emails if not specified)
        setup:                          # Setup the sending
            sendto:                     # Receiver address(es)
                - sampleemail@gmail.com
            gmail:                      # Setup gmail account for sending (the emails will look as to come from this address) - if not provided, try to send through linux "sendmail" command
                login_name: "snakelines.mailclient@gmail.com"   # gmail address for sending emails
                login_pass: "hesielko"                          # gmail password for this address
        onsuccess:                      # Setup emails to send if the analysis succeed
            send: True                  # Send only if true
            list_files: False           # Include list of all generated files
            list_copied: False          # Include list of all copied files
        onerror:                        # Setup emails to send if the analysis failed
            send: True                  # Send only if true
            list_files: True            # Include list of files that should have been generated


Adjust rules parameters
-----------------------

Each step of the analysis have its own configuration, that would be used to parametrize designated rule.
The first argument is a method, that may be easily swap to another supported one by changing configuration.
The rest is the set of parameters that would be passed to designated rule.


.. code-block:: yaml

   reads:                           # Prepare reads and quality reports for downstream analysis
      preprocess:                   # Pre-process of reads, eliminate sequencing artifacts, contamination ...
         trimmed:                   # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

This example snippet would be roughly translated to


.. code-block:: python

   method_config = {'temporary': False,
                    'crop': 500,
                    'quality': 20,
                    'headcrop': 20,
                    'minlen': 35}

   # Configuration would be passed to the included, trimmomatic rule
   include rules/reads/preprocess/trimmed/trimmomatic.snake

Skipping preprocess steps
-------------------------

Parts of preprocess and postprocess steps may be easily skipped by removal of corresponding configuration parts.
This snipped would at first trim reads, then decontaminate them, and finally deduplicate them.

.. code-block:: yaml

   reads:                              # Prepare reads and quality reports for downstream analysis
       preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

           trimmed:                    # Remove low quality parts of reads
               method: trimmomatic     # Supported values: trimmomatic
               temporary: False        # If True, generated files would be removed after successful analysis
               crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
               quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
               headcrop: 20            # Number of bases to remove from the start of read
               minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

           decontaminated:             # Eliminate fragments from known artificial source, e.g. contamination by human
               method: bowtie2         # Supported values: bowtie2
               temporary: False        # If True, generated files would be removed after successful analysis
               references:             # List of reference genomes
                   - mhv
               keep: True              # Keep reads mapped to references (True) or remove them as contamination (False)

           deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
               method: fastuniq        # Supported values: fastuniq
               temporary: False        # If True, generated files would be removed after successful analysis


Omitting configuration for the decontamination step would restrict read preprocessing to the trimming and the deduplication step, only.

.. code-block:: yaml

   reads:                              # Prepare reads and quality reports for downstream analysis
       preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

           trimmed:                    # Remove low quality parts of reads
               method: trimmomatic     # Supported values: trimmomatic
               temporary: False        # If True, generated files would be removed after successful analysis
               crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
               quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
               headcrop: 20            # Number of bases to remove from the start of read
               minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

           deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
               method: fastuniq        # Supported values: fastuniq
               temporary: False        # If True, generated files would be removed after successful analysis


Example configuration
---------------------

Example for a basic variant calling pipeline.

.. literalinclude:: ../../example/mhv/config_variant_calling.yaml
   :language: yaml