Analyse microbial composition

Determine composition of organisms (typically microbial) in sequenced samples. Data analysis pipeline expect data from shotgun sequencing of a single target gene, for example 16S. Visually assess changes in compositions between several samples.

Purpose

  • Identify which organisms are present in sequenced samples
  • Compare an abundance of organisms with each other in a single sample
  • Compare abundances of organisms between samples

Required inputs

  • Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
    • each sample is represented by two gzipped fastq files
    • standard output files of paired-end sequencing
  • .tax file with taxonomic description of each organism in reference fasta
    • TSV format
    • no header
    • each row represents a single target gene from the reference fasta
    • the first column represents the identifier of the gene (the first word after > symbol)
    • the second column represents taxonomy of the organism
    • for example
KF494428.1.1396 Bacteria;Epsilonbacteraeota;Campylobacteria;Campylobacterales;Thiovulaceae;Sulfuricurvum;Sulfuricurvum sp. EW1;;;;;;;;
AF506248.1.1375 Bacteria;Cyanobacteria;Oxyphotobacteria;Nostocales;Nostocaceae;Nostoc PCC-73102;Nostoc sp. 'Nephroma expallidum cyanobiont 23';;;;;;;;
EF603722.1.1487 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Muribaculaceae;;;;;;;;;;
...
|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz

In case, you are using Metaxa2 classifier, you also need reference genomes of target genes in fasta format.

|-- reference/<reference>
        |-- <reference>.fa
        |-- <reference>.tax

Generated outputs

  • Summary table with the number of reads per organism in sequenced sample
  • Pie plot with hierarchical taxonomical composition of the samples
  • Barplots for visual comparison of organisms abundance across sequenced samples

Example

Metaxa2 classifier

How to run example:

You need to provide reference sequences for this example to run. Sequences of the 16S gene (store them as reference/silva-16S/silva-16S.fa) may be downloaded from the Silva database. Sequences of the ITS gene (store them as reference/unite/unite.fa) may be downloaded from the Unite database.

cd /usr/local/snakelines/example/metagenomics

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_metagenomics.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                            # List of sample categories to be analysed
    - name: .*-16S                  # Regex expression of sample names to be analysed (reads/original/.*-16S_R1.fastq.gz)
      reference: silva-16S          # Reference genome for reads in the category (reference/silva-16S/silva-16S.fa)
    - name: .*-ITS                  # Another category of samples, with names that end with "-ITS"
      reference: unite              # Reference genome for reads in the category (reference/unite/unite.fa)

report_dir: report/public/01-classify      # Generated reports and essential output files would be stored there
threads: 16                         # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed

classification:                     # Identify genomic source of sequenced reads
    read_based:                     # Find homologue sequences by comparing reads to reference sequences
        method: metaxa2             # Supported values: metaxa2
        confidence: 0.8             # Reliability cutoff for taxonomic classification

    report:                         # Summary reports of classification results
        taxonomic_counts:           # Number of reads mapped to each taxonomic unit
            pieplot:                # Visualisation in pie plot form
                method: krona       # Supported values: krona

            count_table:            # Summary table with number of reads per taxonomic unit
                method: custom      # Supported values: custom
                tax_levels:         # List of taxonomic levels for which tables would be generated
                    - class
                    - genus

            barplot:                # Visualisation in bar plot form
                method: custom      # Supported values: custom
                formats:            # Output format of the resulting images
                    - png
                    - svg
                tax_levels:         # List of taxonomic levels for which plots would be generated
                    - class
                    - genus

            alpha_diversity:        # Alpha diversity computation
                method: custom      # Supported values: custom
                min_reads: 10     # Minimal number of reads to keep a sample (for downsampling only)
                tax_levels:         # List of taxonomic levels for which alpha diversities would be generated
                    - class
                    - genus

RDP classifier

How to run example:

RDP classifier has already prebuilt databases that may be used, specifically 16srrna, fungallsu, fungalits_warcup, fungalits_unite

cd /usr/local/snakelines/example/metagenomics

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_metagenomics_rdp.yaml

Example configuration:

sequencing: paired_end
samples:                            # List of sample categories to be analysed
    - name: .*-16S                  # Regex expression of sample names to be analysed (reads/original/.*-16S_R1.fastq.gz)
      reference: 16srrna            # RDP classifier Supported values: 16srrna, fungallsu, fungalits_unite, fungalits_warcup
      prebuilt: True                # Reference sequence reference/{reference}/{reference}.fa does not exist, but all required indices are already prepared
    - name: .*-ITS                  # Regex expression of sample names to be analysed (reads/original/.*-ITS_R1.fastq.gz)
      reference: fungalits_unite    # RDP classifier Supported values: 16srrna, fungallsu, fungalits_unite, fungalits_warcup
      prebuilt: True                # Reference sequence reference/{reference}/{reference}.fa does not exist, but all required indices are already prepared

report_dir: report/public/01-classify_rdp  # Generated reports and essential output files would be stored there
threads: 16                         # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        subsampled:
            method: seqtk
            n_reads: 100

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

        joined:                     # Join paired reads into single end reads based on sequence overlap
            method: pear            # Supported values: pear
            n_ambiguous_bases: 20   # Number of N bases between two reads without overlap

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed

classification:                     # Identify genomic source of sequenced reads
    read_based:                     # Find homologue sequences by comparing reads to reference sequences
        method: rdp                 # Supported values: metaxa2, rdp
        confidence: 0.8             # Reliability cutoff for taxonomic classification

    report:                         # Summary reports of classification results
        taxonomic_counts:           # Number of reads mapped to each taxonomic unit
            pieplot:                # Visualisation in pie plot form
                method: krona       # Supported values: krona

            count_table:            # Summary table with number of reads per taxonomic unit
                method: custom      # Supported values: custom
                tax_levels:         # List of taxonomic levels for which tables would be generated
                    - class
                    - genus

            barplot:                # Visualisation in bar plot form
                method: custom      # Supported values: custom
                formats:            # Output format of the resulting images
                    - png
                    - svg
                tax_levels:         # List of taxonomic levels for which plots would be generated
                    - class
                    - genus

Planned improvements

  • Aggregate quality statistics of preprocess and mapping with the MultiQC
  • Add toy reference genome to example
  • Qiime2 based analysis