Analyse microbial composition¶
Determine composition of organisms (typically microbial) in sequenced samples. Data analysis pipeline expect data from shotgun sequencing of a single target gene, for example 16S. Visually assess changes in compositions between several samples.
Purpose¶
- Identify which organisms are present in sequenced samples
- Compare an abundance of organisms with each other in a single sample
- Compare abundances of organisms between samples
Required inputs¶
- Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
- .tax file with taxonomic description of each organism in reference fasta
- TSV format
- no header
- each row represents a single target gene from the reference fasta
- the first column represents the identifier of the gene (the first word after > symbol)
- the second column represents taxonomy of the organism
- for example
KF494428.1.1396 Bacteria;Epsilonbacteraeota;Campylobacteria;Campylobacterales;Thiovulaceae;Sulfuricurvum;Sulfuricurvum sp. EW1;;;;;;;;
AF506248.1.1375 Bacteria;Cyanobacteria;Oxyphotobacteria;Nostocales;Nostocaceae;Nostoc PCC-73102;Nostoc sp. 'Nephroma expallidum cyanobiont 23';;;;;;;;
EF603722.1.1487 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Muribaculaceae;;;;;;;;;;
...
|-- reads/original
|-- <sample_1>_R1.fastq.gz
|-- <sample_1>_R2.fastq.gz
|-- <sample_2>_R1.fastq.gz
|-- <sample_2>_R2.fastq.gz
In case, you are using Metaxa2 classifier, you also need reference genomes of target genes in fasta format.
|-- reference/<reference>
|-- <reference>.fa
|-- <reference>.tax
Generated outputs¶
- Summary table with the number of reads per organism in sequenced sample
- Pie plot with hierarchical taxonomical composition of the samples
- Barplots for visual comparison of organisms abundance across sequenced samples
Example¶
Metaxa2 classifier¶
How to run example:
You need to provide reference sequences for this example to run. Sequences of the 16S gene (store them as reference/silva-16S/silva-16S.fa) may be downloaded from the Silva database. Sequences of the ITS gene (store them as reference/unite/unite.fa) may be downloaded from the Unite database.
cd /usr/local/snakelines/example/metagenomics
snakemake \
--snakefile ../../snakelines.snake \
--configfile config_metagenomics.yaml \
--use-conda
Example configuration:
sequencing: paired_end
samples: # List of sample categories to be analysed
- name: .*-16S # Regex expression of sample names to be analysed (reads/original/.*-16S_R1.fastq.gz)
reference: silva-16S # Reference genome for reads in the category (reference/silva-16S/silva-16S.fa)
- name: .*-ITS # Another category of samples, with names that end with "-ITS"
reference: unite # Reference genome for reads in the category (reference/unite/unite.fa)
report_dir: report/public/01-classify # Generated reports and essential output files would be stored there
threads: 16 # Number of threads to use in analysis
reads: # Prepare reads and quality reports for downstream analysis
preprocess: # Pre-process of reads, eliminate sequencing artifacts, contamination ...
trimmed: # Remove low quality parts of reads
method: trimmomatic # Supported values: trimmomatic
temporary: False # If True, generated files would be removed after successful analysis
crop: 500 # Maximal number of bases in read to keep. Longer reads would be truncated.
quality: 20 # Minimal average quality of read bases to keep (inside sliding window of length 5)
headcrop: 20 # Number of bases to remove from the start of read
minlen: 35 # Minimal length of trimmed read. Shorter reads would be removed.
report: # Summary reports of read characteristics to assess their quality
quality_report: # HTML summary report of read quality
method: fastqc # Supported values: fastqc
read_types: # List of preprocess steps for quality reports
- original
- trimmed
classification: # Identify genomic source of sequenced reads
read_based: # Find homologue sequences by comparing reads to reference sequences
method: metaxa2 # Supported values: metaxa2
confidence: 0.8 # Reliability cutoff for taxonomic classification
report: # Summary reports of classification results
taxonomic_counts: # Number of reads mapped to each taxonomic unit
pieplot: # Visualisation in pie plot form
method: krona # Supported values: krona
count_table: # Summary table with number of reads per taxonomic unit
method: custom # Supported values: custom
tax_levels: # List of taxonomic levels for which tables would be generated
- class
- genus
barplot: # Visualisation in bar plot form
method: custom # Supported values: custom
formats: # Output format of the resulting images
- png
- svg
tax_levels: # List of taxonomic levels for which plots would be generated
- class
- genus
alpha_diversity: # Alpha diversity computation
method: custom # Supported values: custom
min_reads: 10 # Minimal number of reads to keep a sample (for downsampling only)
tax_levels: # List of taxonomic levels for which alpha diversities would be generated
- class
- genus
RDP classifier¶
How to run example:
RDP classifier has already prebuilt databases that may be used, specifically 16srrna, fungallsu, fungalits_warcup, fungalits_unite
cd /usr/local/snakelines/example/metagenomics
snakemake \
--snakefile ../../snakelines.snake \
--configfile config_metagenomics_rdp.yaml
Example configuration:
sequencing: paired_end
samples: # List of sample categories to be analysed
- name: .*-16S # Regex expression of sample names to be analysed (reads/original/.*-16S_R1.fastq.gz)
reference: 16srrna # RDP classifier Supported values: 16srrna, fungallsu, fungalits_unite, fungalits_warcup
prebuilt: True # Reference sequence reference/{reference}/{reference}.fa does not exist, but all required indices are already prepared
- name: .*-ITS # Regex expression of sample names to be analysed (reads/original/.*-ITS_R1.fastq.gz)
reference: fungalits_unite # RDP classifier Supported values: 16srrna, fungallsu, fungalits_unite, fungalits_warcup
prebuilt: True # Reference sequence reference/{reference}/{reference}.fa does not exist, but all required indices are already prepared
report_dir: report/public/01-classify_rdp # Generated reports and essential output files would be stored there
threads: 16 # Number of threads to use in analysis
reads: # Prepare reads and quality reports for downstream analysis
preprocess: # Pre-process of reads, eliminate sequencing artifacts, contamination ...
subsampled:
method: seqtk
n_reads: 100
trimmed: # Remove low quality parts of reads
method: trimmomatic # Supported values: trimmomatic
temporary: False # If True, generated files would be removed after successful analysis
crop: 500 # Maximal number of bases in read to keep. Longer reads would be truncated.
quality: 20 # Minimal average quality of read bases to keep (inside sliding window of length 5)
headcrop: 20 # Number of bases to remove from the start of read
minlen: 35 # Minimal length of trimmed read. Shorter reads would be removed.
joined: # Join paired reads into single end reads based on sequence overlap
method: pear # Supported values: pear
n_ambiguous_bases: 20 # Number of N bases between two reads without overlap
report: # Summary reports of read characteristics to assess their quality
quality_report: # HTML summary report of read quality
method: fastqc # Supported values: fastqc
read_types: # List of preprocess steps for quality reports
- original
- trimmed
classification: # Identify genomic source of sequenced reads
read_based: # Find homologue sequences by comparing reads to reference sequences
method: rdp # Supported values: metaxa2, rdp
confidence: 0.8 # Reliability cutoff for taxonomic classification
report: # Summary reports of classification results
taxonomic_counts: # Number of reads mapped to each taxonomic unit
pieplot: # Visualisation in pie plot form
method: krona # Supported values: krona
count_table: # Summary table with number of reads per taxonomic unit
method: custom # Supported values: custom
tax_levels: # List of taxonomic levels for which tables would be generated
- class
- genus
barplot: # Visualisation in bar plot form
method: custom # Supported values: custom
formats: # Output format of the resulting images
- png
- svg
tax_levels: # List of taxonomic levels for which plots would be generated
- class
- genus