Analyse gene expression

Determine level of expression of individual transcripts from sequenced RNA. Compare expressions between samples from two conditions to identify transcripts with changed expression.

Purpose

  • Identify which transcripts are expressed
  • Compare expression of transcripts with each other in a single sample
  • Compare changes of expression between samples with different conditions, e.g.
    • environment
    • tissues
    • phenotypic trait
  • Assess impact of a gene on biochemical pathways (compare wild type with sample without the gene)

Required inputs

  • Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
    • each sample is represented by two gzipped fastq files
    • standard output files of paired-end sequencing
  • Transcripts of reference genome in fasta format
  • Sample metadata file in TSV format
    • each row describe a single sequenced sample
    • the first, ‘sample-id’ column represents names of fastq files without _R[12].fastq.gz part, e.g. sample_1, sample_2
    • other columns represent attributes of sequenced samples that may be used to separate samples into categories for comparison, e.g. environment, tissue
|-- description
        |-- sample-metadata.csv
|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz
|-- reference/<reference>
        |-- <reference>.transcripts.fa

Generated outputs

  • Summary table of number of transcripts per sample, normalised by read count and length of transcripts
  • Graphical, 2D PCA comparison of samples based on their expression to visually assess relationships between them
  • Set of transcripts with significant set of expression between selected conditions

Example

How to run example:

cd /usr/local/snakelines/example/rnaseq

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_transcriptomics.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                            # List of sample categories to be analysed
    - name: .*                      # Regex expression of sample names to be analysed (reads/original/.*-16S_R1.fastq.gz)
      reference: pombe              # Reference genome for reads in the category (reference/silva-16S/silva-16S.fa)

report_dir: report/public/01-rnaseq # Generated reports and essential output files would be stored there
threads: 16                         # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed

classification:                     # Identify genomic source of sequenced reads
    transcripts:                    # Find homologue sequences by comparing reads to reference sequences
        method: salmon              # Supported values: salmon
        library:                    # According to notation in https://salmon.readthedocs.io/en/latest/library_type.html
            orientation: inward     # Relative orientation of reads in DNA fragments - inward/outward/matching
            strandedness: stranded  # Preserve sequencing DNA strand of fragment? - stranded/unstranded
            direction: reverse      # Orientation of the first read in a fragment - forward/reverse
        count_type: mapped_reads    # Type of counts to report - mapped_reads/tpm (transcripts per kilobase million)

    differential_analysis:          # Find transcripts with significant change across two sample groups
        method: edger               # Supported values: edger
        group_by: culture           # Attribute name (in the metadata file header) that split samples into two groups for comparison
        batch: seqrun               # Attribute name (in the metadata file header) that split samples into groups with similar batch effect for correction
        filter_significant:                # Filter transcripts with significant change in expression
            method: custom                 # Supported values: custom
            max_fdr: 0.05                  # Maximal value of fold discovery change for transcript to be reported
            min_fold_change: 1.5           # Minimal value of fold change for transcript to be reported
            reproducible_expression: True  # At least one read must be mapped to transcript in all samples from over-expressed group to be reported

    report:
        transcripts:

            count_table:               # Summary table with number of reads per transcript
                method: custom         # Supported values: custom

            html_table:                # Summary HTML table with results of differential expressions
                method: custom         # Supported values: custom
                annotation:            # Annotate with attributes from external sources
                    - source: pombase  # Annotate with attributes from reference/{reference}/annotation.transcripts/pombas/attributes.tsv
                      attributes:      # Annotate with listed attributes only
                        - link         # Annotate with attribute link
                    - source: go       # Annotate with attributes from reference/{reference}/annotation.transcripts/go/attributes.tsv
                                       # Annotate with all attributes since explicit attributes are not defined

            revigo:                    # GO annotation terms in format suitable for visualisation on the ReviGO website (http://revigo.irb.hr/)
                method: custom         # Supported values: custom


            pca:
                method: sklearn     # Supported values: sklearn
                formats:            # Output format of the resulting images
                    - png
                    - svg

Planned improvements

  • Aggregate quality statistics of preprocess and mapping with the MultiQC
  • Add mapped based approach with resulting coverage tracks