Download reference and map paired-end reads¶
At first, download sequences from NCBI according to selected genbank ids and prepare reference database. Then proceed with the standard mapping pipeline. Align sequenced paired-end reads to a reference genome to find the most probable genomic positions of origin. Reference genome may represent DNA of a single organism, multiple organisms, transcripts…
The reference part of the configuration may be similarly used for other types of analysis, such as variant calling, methylation…
Purpose¶
- Download genomic sequences and prepare reference database
- Identify the source of sequenced material
- Assess composition of sequenced material
- Removal of a known contamination
- Purification of sequenced reads - select only reads from the target organism
- Essential step for several downstream analysis types - variant calling, transcriptomics …
Required inputs¶
- Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
|-- reads/original
|-- <sample_1>_R1.fastq.gz
|-- <sample_1>_R2.fastq.gz
|-- <sample_2>_R1.fastq.gz
|-- <sample_2>_R2.fastq.gz
Generated outputs¶
- Reference sequence database with taxonomies
- Mapped reads in sorted, indexed BAM format with marked duplicates
- Reports to assess mapping quality
- individual report for each sample
- summary report for comparison of multiple samples
Example¶
How to run example:
cd /usr/local/snakelines/example/genomic
snakemake \
--snakefile ../../snakelines.snake \
--configfile config_download_reference_and_mapping.yaml \
--use-conda
Example configuration:
sequencing: paired_end
samples: # List of sample categories to be analysed
- name: example.* # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)
reference: mhv_ncbi # Reference genome for reads in the category (reference/mhv/mhv.fa)
report_dir: report/public/01-viral # Generated reports and essential output files would be stored there
threads: 16 # Number of threads to use in analysis
reference:
download:
method: entrez # Supported values: entrez
email: FILLME@SOMEMAIL.COM # Inform NCBI who you are to contact you in case of excessive use.
mhv_ncbi: # List of genbank ids to download, one list for each reference database
- U97553.2
- AF127083.1
reads: # Prepare reads and quality reports for downstream analysis
preprocess: # Pre-process of reads, eliminate sequencing artifacts, contamination ...
trimmed: # Remove low quality parts of reads
method: trimmomatic # Supported values: trimmomatic
temporary: False # If True, generated files would be removed after successful analysis
crop: 500 # Maximal number of bases in read to keep. Longer reads would be truncated.
quality: 20 # Minimal average quality of read bases to keep (inside sliding window of length 5)
headcrop: 20 # Number of bases to remove from the start of read
minlen: 35 # Minimal length of trimmed read. Shorter reads would be removed.
deduplicated: # Remove fragments with the same sequence (PCR duplicated)
method: fastuniq # Supported values: fastuniq
temporary: False # If True, generated files would be removed after successful analysis
report: # Summary reports of read characteristics to assess their quality
quality_report: # HTML summary report of read quality
method: fastqc # Supported values: fastqc
read_types: # List of preprocess steps for quality reports
- original
- trimmed
- deduplicated
mapping: # Find the most similar genomic region to reads in reference (mapping process)
mapper: # Method for mapping
method: bowtie2
params: --very-sensitive # Additional parameters
only_concordant: False # Keep only reads with consistently mapped reads from both paired-ends
index: # Generate .bai index for mapped reads in .bam files
method: samtools # Supported values: samtools
postprocess: # Successive steps to refine mapped reads
original: # Reads retrieved from mapping process
temporary: True # If True, generated files would be removed after successful analysis
sorted:
method: samtools
temporary: False
report: # Summary reports of mapping process and results
quality_report: # HTML summary with quality of mappings
method: qualimap # Supported values: qualimap
map_types: # List of post-process steps for quality reports
- sorted
Planned improvements¶
- Aggregate quality statistics of preprocess and mapping with the MultiQC
- Realignment postprocess step to refine alignment in indel regions