Preprocess sequencing reads¶
Remove parts or whole reads that are artifacts of laboratory sequencing process. They may blur a downstream analysis, and so lead to incorrect conclusions in their interpretations. According to provided configuration, preprocess may include:
- Removal of low quality parts of reads or adapters
- Removal of PCR duplicates
- Select a fixed number of reads from each sample to ensure consistency
- Removal of contamination from a known genome
- Selecting only fragments from a known genome
- Merging paired reads based on their read sequence overlap
Purpose¶
- Remove sequencing artifacts
- Clean-up sequencing data for downstream analysis
- Avoid false interpretation of data analysis results due to laboratory-induced or technical bias
Required inputs¶
- Sequenced reads in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
|-- reads/original
|-- <sample_1>_R1.fastq.gz
|-- <sample_1>_R2.fastq.gz
|-- <sample_2>_R1.fastq.gz
|-- <sample_2>_R2.fastq.gz
Generated outputs¶
- Refined reads in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- Quality reports for resulting and intermediate reads to assess effect of individual preprocess steps
Example¶
How to run example:
cd /usr/local/snakelines/example/mhv
snakemake \
--snakefile ../../snakelines.snake \
--configfile config_preprocess.yaml
Example configuration:
Planned improvements¶
- Aggregate quality statistics of multiple samples and processing steps with the MultiQC