Download reference and map paired-end reads¶

At first, download sequences from NCBI according to selected genbank ids and prepare reference database. Then proceed with the standard mapping pipeline. Align sequenced paired-end reads to a reference genome to find the most probable genomic positions of origin. Reference genome may represent DNA of a single organism, multiple organisms, transcripts…

The reference part of the configuration may be similarly used for other types of analysis, such as variant calling, methylation…

Purpose¶

Download genomic sequences and prepare reference database
Identify the source of sequenced material
Assess composition of sequenced material
Removal of a known contamination
Purification of sequenced reads - select only reads from the target organism
Essential step for several downstream analysis types - variant calling, transcriptomics …

Required inputs¶

Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing

|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz

Generated outputs¶

Reference sequence database with taxonomies
Mapped reads in sorted, indexed BAM format with marked duplicates
Reports to assess mapping quality
- individual report for each sample
- summary report for comparison of multiple samples

Example¶

How to run example:

cd /usr/local/snakelines/example/genomic

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_download_reference_and_mapping.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                             # List of sample categories to be analysed
    - name: example.*                # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)
      reference: mhv_ncbi            # Reference genome for reads in the category (reference/mhv/mhv.fa)

report_dir: report/public/01-viral   # Generated reports and essential output files would be stored there
threads: 16                          # Number of threads to use in analysis

reference:
    download:
        method: entrez               # Supported values: entrez
        email: FILLME@SOMEMAIL.COM   # Inform NCBI who you are to contact you in case of excessive use.
        mhv_ncbi:                    # List of genbank ids to download, one list for each reference database
            - U97553.2
            - AF127083.1

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...
        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
            method: fastuniq        # Supported values: fastuniq
            temporary: False        # If True, generated files would be removed after successful analysis

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed
                - deduplicated

mapping:                            # Find the most similar genomic region to reads in reference (mapping process)
    mapper:                         # Method for mapping
        method: bowtie2
        params: --very-sensitive    # Additional parameters
        only_concordant: False      # Keep only reads with consistently mapped reads from both paired-ends

    index:                          # Generate .bai index for mapped reads in .bam files
        method: samtools            # Supported values: samtools

    postprocess:                    # Successive steps to refine mapped reads
        original:                  # Reads retrieved from mapping process
            temporary: True         # If True, generated files would be removed after successful analysis
        sorted:
            method: samtools
            temporary: False

    report:                         # Summary reports of mapping process and results
        quality_report:             # HTML summary with quality of mappings
            method: qualimap        # Supported values: qualimap
            map_types:              # List of post-process steps for quality reports
                - sorted

Planned improvements¶

Aggregate quality statistics of preprocess and mapping with the MultiQC
Realignment postprocess step to refine alignment in indel regions

Download reference and map paired-end reads¶

Purpose¶

Required inputs¶

Generated outputs¶

Example¶

Planned improvements¶

Included pipelines¶

Table of Contents

Previous topic

Next topic

This Page