Download reference and map reads¶

At first, download sequences from NCBI according to selected genbank ids and prepare reference database. Then proceed with the standard mapping pipeline. Align sequenced reads to a reference genome to find the most probable genomic positions of origin. Reference genome may represent DNA of a single organism, multiple organisms, transcripts…

The reference part of the configuration may be similarly used for other types of analysis, such as variant calling, methylation…

Purpose¶

Download genomic sequences and prepare reference database
Identify the source of sequenced material
Assess composition of sequenced material
Removal of a known contamination
Purification of sequenced reads - select only reads from the target organism
Essential step for several downstream analysis types - variant calling, transcriptomics …

Required inputs¶

Sequenced reads in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing

|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz

Generated outputs¶

Reference sequence database with taxonomies
Mapped reads in sorted, indexed BAM format with marked duplicates
Reports to assess mapping quality
- individual report for each sample
- summary report for comparison of multiple samples

Example¶

How to run example:

cd /usr/local/snakelines/example/mhv

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_download_reference_and_mapping.yaml

Example configuration:

Planned improvements¶

Aggregate quality statistics of preprocess and mapping with the MultiQC
Realignment postprocess step to refine alignment in indel regions

Download reference and map reads¶

Purpose¶

Required inputs¶

Generated outputs¶

Example¶

Planned improvements¶

Included pipelines¶

Table of Contents

Previous topic

Next topic

This Page