Assemble reads¶

Join paired-end reads with overlaps into larger continuous genomic sequences, contigs.

Purpose¶

Determine genomic sequence of sequenced organism
Reduce the huge number of small read sequences into larger, more manageable genomic contigs

Required inputs¶

Sequenced paired-end reads from Illumina sequncer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing

|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz

Generated outputs¶

Assembled genomic sequences (contigs) in fasta format
Reports to assess quality of assembly
Graph visualisation of assembly to visually assess its complexity

Example¶

How to run example:

cd /usr/local/snakelines/example/genomic

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_assembly.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                            # List of sample categories to be analysed
    - name: example.*               # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)
      reference: mhv                # Reference genome for reads in the category (reference/mhv/mhv.fa)

report_dir: report/public/01-assembly   # Generated reports and essential output files would be stored there
threads: 16                         # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        decontaminated:             # Eliminate fragments from known artificial source, e.g. contamination by human
            method: bowtie2         # Supported values: bowtie2
            temporary: False        # If True, generated files would be removed after successful analysis
            references:             # List of reference genomes
                - mhv
            keep: True              # Keep reads mapped to references (True) or remove them as contamination (False)

        deduplicated:               # Remove fragments with the same sequence (PCR duplicated)
            method: fastuniq        # Supported values: fastuniq
            temporary: False        # If True, generated files would be removed after successful analysis

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed
                - decontaminated
                - deduplicated

assembly:                           # Join reads into longer sequences (contigs) based on their overlaps
    assembler:                      # Method for joining reads
        method: spades              # Supported values: spades, unicycler, megahit
        mode: standard              # Supported values: standard, meta, plasmid, rna, iontorrent
        careful: True               # Can not be combined with the meta mode. Reduce number of mismatches and short indels, longer runtime

    report:                         # Summary reports for assembly process and results
        quality_report:             # Quality of assembled contigs
            method: quast           # Supported values: quast

        assembly_graph:             # Visualisation of overlaps between assembled contigs
            method: bandage         # Supported values: bandage

Planned improvements¶

Aggregate quality statistics of preprocess and mapping with the MultiQC
Connect contigs into scaffolds based on known genomic sequence of related organism
Aggregate quast results of individual samples into summary report

Assemble reads¶

Purpose¶

Required inputs¶

Generated outputs¶

Example¶

Planned improvements¶

Included pipelines¶

Table of Contents

Previous topic

Next topic

This Page