Analyse microbial composition¶

Determine composition of organisms (typically microbial) in sequenced samples. Data analysis pipeline expect data from shotgun sequencing of a single target gene, for example 16S. Visually assess changes in compositions between several samples.

Purpose¶

Identify which organisms are present in sequenced samples
Compare an abundance of organisms with each other in a single sample
Compare abundances of organisms between samples

Required inputs¶

Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
.tax file with taxonomic description of each organism in reference fasta
- TSV format
- no header
- each row represents a single target gene from the reference fasta
- the first column represents the identifier of the gene (the first word after > symbol)
- the second column represents taxonomy of the organism
- for example

KF494428.1.1396 Bacteria;Epsilonbacteraeota;Campylobacteria;Campylobacterales;Thiovulaceae;Sulfuricurvum;Sulfuricurvum sp. EW1;;;;;;;;
AF506248.1.1375 Bacteria;Cyanobacteria;Oxyphotobacteria;Nostocales;Nostocaceae;Nostoc PCC-73102;Nostoc sp. 'Nephroma expallidum cyanobiont 23';;;;;;;;
EF603722.1.1487 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Muribaculaceae;;;;;;;;;;
...

|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz

In case, you are using Metaxa2 classifier, you also need reference genomes of target genes in fasta format.

|-- reference/<reference>
        |-- <reference>.fa
        |-- <reference>.tax

Generated outputs¶

Summary table with the number of reads per organism in sequenced sample
Pie plot with hierarchical taxonomical composition of the samples
Barplots for visual comparison of organisms abundance across sequenced samples

Example¶

Metaxa2 classifier¶

How to run example:

You need to provide reference sequences for this example to run. Sequences of the 16S gene (store them as reference/silva-16S/silva-16S.fa) may be downloaded from the Silva database. Sequences of the ITS gene (store them as reference/unite/unite.fa) may be downloaded from the Unite database.

cd /usr/local/snakelines/example/metagenomics

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_metagenomics.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                            # List of sample categories to be analysed
    - name: .*-16S                  # Regex expression of sample names to be analysed (reads/original/.*-16S_R1.fastq.gz)
      reference: silva-16S          # Reference genome for reads in the category (reference/silva-16S/silva-16S.fa)
    - name: .*-ITS                  # Another category of samples, with names that end with "-ITS"
      reference: unite              # Reference genome for reads in the category (reference/unite/unite.fa)

report_dir: report/public/01-classify      # Generated reports and essential output files would be stored there
threads: 16                         # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed

classification:                     # Identify genomic source of sequenced reads
    read_based:                     # Find homologue sequences by comparing reads to reference sequences
        method: metaxa2             # Supported values: metaxa2
        confidence: 0.8             # Reliability cutoff for taxonomic classification

    report:                         # Summary reports of classification results
        taxonomic_counts:           # Number of reads mapped to each taxonomic unit
            pieplot:                # Visualisation in pie plot form
                method: krona       # Supported values: krona

            count_table:            # Summary table with number of reads per taxonomic unit
                method: custom      # Supported values: custom
                tax_levels:         # List of taxonomic levels for which tables would be generated
                    - class
                    - genus

            barplot:                # Visualisation in bar plot form
                method: custom      # Supported values: custom
                formats:            # Output format of the resulting images
                    - png
                    - svg
                tax_levels:         # List of taxonomic levels for which plots would be generated
                    - class
                    - genus

            alpha_diversity:        # Alpha diversity computation
                method: custom      # Supported values: custom
                min_reads: 10     # Minimal number of reads to keep a sample (for downsampling only)
                tax_levels:         # List of taxonomic levels for which alpha diversities would be generated
                    - class
                    - genus

RDP classifier¶

How to run example:

RDP classifier has already prebuilt databases that may be used, specifically 16srrna, fungallsu, fungalits_warcup, fungalits_unite

cd /usr/local/snakelines/example/metagenomics

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_metagenomics_rdp.yaml

Example configuration:

sequencing: paired_end
samples:                            # List of sample categories to be analysed
    - name: .*-16S                  # Regex expression of sample names to be analysed (reads/original/.*-16S_R1.fastq.gz)
      reference: 16srrna            # RDP classifier Supported values: 16srrna, fungallsu, fungalits_unite, fungalits_warcup
      prebuilt: True                # Reference sequence reference/{reference}/{reference}.fa does not exist, but all required indices are already prepared
    - name: .*-ITS                  # Regex expression of sample names to be analysed (reads/original/.*-ITS_R1.fastq.gz)
      reference: fungalits_unite    # RDP classifier Supported values: 16srrna, fungallsu, fungalits_unite, fungalits_warcup
      prebuilt: True                # Reference sequence reference/{reference}/{reference}.fa does not exist, but all required indices are already prepared

report_dir: report/public/01-classify_rdp  # Generated reports and essential output files would be stored there
threads: 16                         # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        subsampled:
            method: seqtk
            n_reads: 100

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        joined:                     # Join paired reads into single end reads based on sequence overlap
            method: pear            # Supported values: pear
            n_ambiguous_bases: 20   # Number of N bases between two reads without overlap

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed

classification:                     # Identify genomic source of sequenced reads
    read_based:                     # Find homologue sequences by comparing reads to reference sequences
        method: rdp                 # Supported values: metaxa2, rdp
        confidence: 0.8             # Reliability cutoff for taxonomic classification

    report:                         # Summary reports of classification results
        taxonomic_counts:           # Number of reads mapped to each taxonomic unit
            pieplot:                # Visualisation in pie plot form
                method: krona       # Supported values: krona

            count_table:            # Summary table with number of reads per taxonomic unit
                method: custom      # Supported values: custom
                tax_levels:         # List of taxonomic levels for which tables would be generated
                    - class
                    - genus

            barplot:                # Visualisation in bar plot form
                method: custom      # Supported values: custom
                formats:            # Output format of the resulting images
                    - png
                    - svg
                tax_levels:         # List of taxonomic levels for which plots would be generated
                    - class
                    - genus

Planned improvements¶

Aggregate quality statistics of preprocess and mapping with the MultiQC
Add toy reference genome to example
Qiime2 based analysis

Analyse microbial composition¶

Purpose¶

Required inputs¶

Generated outputs¶

Example¶

Metaxa2 classifier¶

RDP classifier¶

Planned improvements¶

Included pipelines¶

Table of Contents

Previous topic

Next topic

This Page