Assemble reads

Join reads with overlaps into larger continuous genomic sequences, contigs.

Purpose

  • Determine genomic sequence of sequenced organism
  • Reduce the huge number of small read sequences into larger, more manageable genomic contigs

Required inputs

  • Sequenced reads in gzipped fastq format.
    • each sample is represented by two gzipped fastq files
    • standard output files of paired-end sequencing
|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz

Generated outputs

  • Assembled genomic sequences (contigs) in fasta format
  • Reports to assess quality of assembly
  • Graph visualisation of assembly to visually assess its complexity

Example

How to run example:

cd /usr/local/snakelines/example/mhv

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_assembly.yaml

Example configuration:

Planned improvements

  • Aggregate quality statistics of preprocess and mapping with the MultiQC
  • Connect contigs into scaffolds based on known genomic sequence of related organism
  • Aggregate quast results of individual samples into summary report