Assemble reads¶
Join paired-end reads with overlaps into larger continuous genomic sequences, contigs.
Purpose¶
- Determine genomic sequence of sequenced organism
- Reduce the huge number of small read sequences into larger, more manageable genomic contigs
Required inputs¶
- Sequenced paired-end reads from Illumina sequncer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
|-- reads/original
|-- <sample_1>_R1.fastq.gz
|-- <sample_1>_R2.fastq.gz
|-- <sample_2>_R1.fastq.gz
|-- <sample_2>_R2.fastq.gz
Generated outputs¶
- Assembled genomic sequences (contigs) in fasta format
- Reports to assess quality of assembly
- Graph visualisation of assembly to visually assess its complexity
Example¶
How to run example:
cd /usr/local/snakelines/example/genomic
snakemake \
--snakefile ../../snakelines.snake \
--configfile config_assembly.yaml \
--use-conda
Example configuration:
sequencing: paired_end
samples: # List of sample categories to be analysed
- name: example.* # Regex expression of sample names to be analysed (reads/original/example.*_R1.fastq.gz)
reference: mhv # Reference genome for reads in the category (reference/mhv/mhv.fa)
report_dir: report/public/01-assembly # Generated reports and essential output files would be stored there
threads: 16 # Number of threads to use in analysis
reads: # Prepare reads and quality reports for downstream analysis
preprocess: # Pre-process of reads, eliminate sequencing artifacts, contamination ...
trimmed: # Remove low quality parts of reads
method: trimmomatic # Supported values: trimmomatic
temporary: False # If True, generated files would be removed after successful analysis
crop: 500 # Maximal number of bases in read to keep. Longer reads would be truncated.
quality: 20 # Minimal average quality of read bases to keep (inside sliding window of length 5)
headcrop: 20 # Number of bases to remove from the start of read
minlen: 35 # Minimal length of trimmed read. Shorter reads would be removed.
decontaminated: # Eliminate fragments from known artificial source, e.g. contamination by human
method: bowtie2 # Supported values: bowtie2
temporary: False # If True, generated files would be removed after successful analysis
references: # List of reference genomes
- mhv
keep: True # Keep reads mapped to references (True) or remove them as contamination (False)
deduplicated: # Remove fragments with the same sequence (PCR duplicated)
method: fastuniq # Supported values: fastuniq
temporary: False # If True, generated files would be removed after successful analysis
report: # Summary reports of read characteristics to assess their quality
quality_report: # HTML summary report of read quality
method: fastqc # Supported values: fastqc
read_types: # List of preprocess steps for quality reports
- original
- trimmed
- decontaminated
- deduplicated
assembly: # Join reads into longer sequences (contigs) based on their overlaps
assembler: # Method for joining reads
method: spades # Supported values: spades, unicycler, megahit
mode: standard # Supported values: standard, meta, plasmid, rna, iontorrent
careful: True # Can not be combined with the meta mode. Reduce number of mismatches and short indels, longer runtime
report: # Summary reports for assembly process and results
quality_report: # Quality of assembled contigs
method: quast # Supported values: quast
assembly_graph: # Visualisation of overlaps between assembled contigs
method: bandage # Supported values: bandage