Download reference and map paired-end reads

At first, download sequences from NCBI according to selected genbank ids and prepare reference database. Then proceed with the standard mapping pipeline. Align sequenced paired-end reads to a reference genome to find the most probable genomic positions of origin. Reference genome may represent DNA of a single organism, multiple organisms, transcripts…

The reference part of the configuration may be similarly used for other types of analysis, such as variant calling, methylation…

Purpose

  • Download genomic sequences and prepare reference database
  • Identify the source of sequenced material
  • Assess composition of sequenced material
  • Removal of a known contamination
  • Purification of sequenced reads - select only reads from the target organism
  • Essential step for several downstream analysis types - variant calling, transcriptomics …

Required inputs

  • Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
    • each sample is represented by two gzipped fastq files
    • standard output files of paired-end sequencing
|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz

Generated outputs

  • Reference sequence database with taxonomies
  • Mapped reads in sorted, indexed BAM format with marked duplicates
  • Reports to assess mapping quality
    • individual report for each sample
    • summary report for comparison of multiple samples

Example

How to run example:

cd /usr/local/snakelines/example/genomic

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_download_reference_and_mapping.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                             # List of sample categories to be analysed
    - name: example.*                # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)
      reference: mhv_ncbi            # Reference genome for reads in the category (reference/mhv/mhv.fa)

report_dir: report/public/01-viral   # Generated reports and essential output files would be stored there
threads: 16                          # Number of threads to use in analysis

reference:
    download:
        method: entrez               # Supported values: entrez
        email: FILLME@SOMEMAIL.COM   # Inform NCBI who you are to contact you in case of excessive use.
        mhv_ncbi:                    # List of genbank ids to download, one list for each reference database
            - U97553.2
            - AF127083.1

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...
        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
            method: fastuniq        # Supported values: fastuniq
            temporary: False        # If True, generated files would be removed after successful analysis

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed
                - deduplicated

mapping:                            # Find the most similar genomic region to reads in reference (mapping process)
    mapper:                         # Method for mapping
        method: bowtie2
        params: --very-sensitive    # Additional parameters
        only_concordant: False      # Keep only reads with consistently mapped reads from both paired-ends

    index:                          # Generate .bai index for mapped reads in .bam files
        method: samtools            # Supported values: samtools

    postprocess:                    # Successive steps to refine mapped reads
        original:                  # Reads retrieved from mapping process
            temporary: True         # If True, generated files would be removed after successful analysis
        sorted:
            method: samtools
            temporary: False

    report:                         # Summary reports of mapping process and results
        quality_report:             # HTML summary with quality of mappings
            method: qualimap        # Supported values: qualimap
            map_types:              # List of post-process steps for quality reports
                - sorted

Planned improvements

  • Aggregate quality statistics of preprocess and mapping with the MultiQC
  • Realignment postprocess step to refine alignment in indel regions