Preprocess sequencing paired-end reads¶
Remove parts or whole reads that are artifacts of laboratory sequencing process. They may blur a downstream analysis, and so lead to incorrect conclusions in their interpretations. According to provided configuration, preprocess may include:
- Removal of low quality parts of reads or adapters
- Removal of PCR duplicates
- Select a fixed number of reads from each sample to ensure consistency
- Removal of contamination from a known genome
- Selecting only fragments from a known genome
- Merging paired reads based on their read sequence overlap
Purpose¶
- Remove sequencing artifacts
- Clean-up sequencing data for downstream analysis
- Avoid false interpretation of data analysis results due to laboratory-induced or technical bias
Required inputs¶
- Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
|-- reads/original
|-- <sample_1>_R1.fastq.gz
|-- <sample_1>_R2.fastq.gz
|-- <sample_2>_R1.fastq.gz
|-- <sample_2>_R2.fastq.gz
Generated outputs¶
- Refined reads in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- Quality reports for resulting and intermediate reads to assess effect of individual preprocess steps
Example¶
How to run example:
cd /usr/local/snakelines/example/genomic
snakemake \
--snakefile ../../snakelines.snake \
--configfile config_preprocess.yaml \
--use-conda
Example configuration:
sequencing: paired_end
samples: # List of sample categories to be analysed
- name: example.* # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)
report_dir: report/public/01-preprocess # Generated reports and essential output files would be stored there
threads: 16 # Number of threads to use in analysis
reads: # Prepare reads and quality reports for downstream analysis
preprocess: # Pre-process of reads, eliminate sequencing artifacts, contamination ...
trimmed: # Remove low quality parts of reads
method: trimmomatic # Supported values: trimmomatic
temporary: False # If True, generated files would be removed after successful analysis
crop: 500 # Maximal number of bases in read to keep. Longer reads would be truncated.
quality: 20 # Minimal average quality of read bases to keep (inside sliding window of length 5)
headcrop: 20 # Number of bases to remove from the start of read
minlen: 35 # Minimal length of trimmed read. Shorter reads would be removed.
decontaminated: # Eliminate fragments from known artificial source, e.g. contamination by human
method: bowtie2 # Supported values: bowtie2
temporary: False # If True, generated files would be removed after successful analysis
references: # List of reference genomes
- mhv
keep: True # Keep reads mapped to references (True) or remove them as contamination (False)
deduplicated: # Remove fragments with the same sequence (PCR duplicated)
method: fastuniq # Supported values: fastuniq
temporary: False # If True, generated files would be removed after successful analysis
subsampled: # Randomly select subset of reads
method: seqtk # Supported values: seqtk
n_reads: 10 # Number of reads to select
report: # Summary reports of read characteristics to assess their quality
quality_report: # HTML summary report of read quality
method: fastqc # Supported values: fastqc
read_types: # List of preprocess steps for quality reports
- original
- trimmed
- decontaminated
- deduplicated
- subsampled
Planned improvements¶
- Aggregate quality statistics of multiple samples and processing steps with the MultiQC