Preprocess sequencing paired-end reads¶

Remove parts or whole reads that are artifacts of laboratory sequencing process. They may blur a downstream analysis, and so lead to incorrect conclusions in their interpretations. According to provided configuration, preprocess may include:

Removal of low quality parts of reads or adapters
Removal of PCR duplicates
Select a fixed number of reads from each sample to ensure consistency
Removal of contamination from a known genome
Selecting only fragments from a known genome
Merging paired reads based on their read sequence overlap

Purpose¶

Remove sequencing artifacts
Clean-up sequencing data for downstream analysis
Avoid false interpretation of data analysis results due to laboratory-induced or technical bias

Required inputs¶

Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing

|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz

Generated outputs¶

Refined reads in gzipped fastq format.
- each sample is represented by two gzipped fastq files
Quality reports for resulting and intermediate reads to assess effect of individual preprocess steps

Example¶

How to run example:

cd /usr/local/snakelines/example/genomic

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_preprocess.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                            # List of sample categories to be analysed
    - name: example.*               # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)

report_dir: report/public/01-preprocess    # Generated reports and essential output files would be stored there
threads: 16                         # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        decontaminated:             # Eliminate fragments from known artificial source, e.g. contamination by human
            method: bowtie2         # Supported values: bowtie2
            temporary: False        # If True, generated files would be removed after successful analysis
            references:             # List of reference genomes
                - mhv
            keep: True              # Keep reads mapped to references (True) or remove them as contamination (False)

        deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
            method: fastuniq        # Supported values: fastuniq
            temporary: False        # If True, generated files would be removed after successful analysis

        subsampled:                 # Randomly select subset of reads
            method: seqtk           # Supported values: seqtk
            n_reads: 10             # Number of reads to select

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed
                - decontaminated
                - deduplicated
                - subsampled

Planned improvements¶

Aggregate quality statistics of multiple samples and processing steps with the MultiQC

Included pipelines¶

Read quality report

Preprocess sequencing paired-end reads¶

Purpose¶

Required inputs¶

Generated outputs¶

Example¶

Planned improvements¶

Included pipelines¶

Table of Contents

Previous topic

Next topic

This Page