Analyse gene expression¶

Determine level of expression of individual transcripts from sequenced RNA. Compare expressions between samples from two conditions to identify transcripts with changed expression.

Purpose¶

Identify which transcripts are expressed
Compare expression of transcripts with each other in a single sample
Compare changes of expression between samples with different conditions, e.g.
- environment
- tissues
- phenotypic trait
Assess impact of a gene on biochemical pathways (compare wild type with sample without the gene)

Required inputs¶

Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
Transcripts of reference genome in fasta format
Sample metadata file in TSV format
- each row describe a single sequenced sample
- the first, ‘sample-id’ column represents names of fastq files without _R[12].fastq.gz part, e.g. sample_1, sample_2
- other columns represent attributes of sequenced samples that may be used to separate samples into categories for comparison, e.g. environment, tissue

|-- description
        |-- sample-metadata.csv
|-- reads/original
        |-- <sample_1>_R1.fastq.gz
        |-- <sample_1>_R2.fastq.gz
        |-- <sample_2>_R1.fastq.gz
        |-- <sample_2>_R2.fastq.gz
|-- reference/<reference>
        |-- <reference>.transcripts.fa

Generated outputs¶

Summary table of number of transcripts per sample, normalised by read count and length of transcripts
Graphical, 2D PCA comparison of samples based on their expression to visually assess relationships between them
Set of transcripts with significant set of expression between selected conditions

Example¶

How to run example:

cd /usr/local/snakelines/example/rnaseq

snakemake \
   --snakefile ../../snakelines.snake \
   --configfile config_transcriptomics.yaml \
   --use-conda

Example configuration:

sequencing: paired_end
samples:                            # List of sample categories to be analysed
    - name: .*                      # Regex expression of sample names to be analysed (reads/original/.*-16S_R1.fastq.gz)
      reference: pombe              # Reference genome for reads in the category (reference/silva-16S/silva-16S.fa)

report_dir: report/public/01-rnaseq # Generated reports and essential output files would be stored there
threads: 16                         # Number of threads to use in analysis

reads:                              # Prepare reads and quality reports for downstream analysis
    preprocess:                     # Pre-process of reads, eliminate sequencing artifacts, contamination ...

        trimmed:                    # Remove low quality parts of reads
            method: trimmomatic     # Supported values: trimmomatic
            temporary: False        # If True, generated files would be removed after successful analysis
            crop: 500               # Maximal number of bases in read to keep. Longer reads would be truncated.
            quality: 20             # Minimal average quality of read bases to keep (inside sliding window of length 5)
            headcrop: 20            # Number of bases to remove from the start of read
            minlen: 35              # Minimal length of trimmed read. Shorter reads would be removed.

    report:                         # Summary reports of read characteristics to assess their quality
        quality_report:             # HTML summary report of read quality
            method: fastqc          # Supported values: fastqc
            read_types:             # List of preprocess steps for quality reports
                - original
                - trimmed

classification:                     # Identify genomic source of sequenced reads
    transcripts:                    # Find homologue sequences by comparing reads to reference sequences
        method: salmon              # Supported values: salmon
        library:                    # According to notation in https://salmon.readthedocs.io/en/latest/library_type.html
            orientation: inward     # Relative orientation of reads in DNA fragments - inward/outward/matching
            strandedness: stranded  # Preserve sequencing DNA strand of fragment? - stranded/unstranded
            direction: reverse      # Orientation of the first read in a fragment - forward/reverse
        count_type: mapped_reads    # Type of counts to report - mapped_reads/tpm (transcripts per kilobase million)

    differential_analysis:          # Find transcripts with significant change across two sample groups
        method: edger               # Supported values: edger
        group_by: culture           # Attribute name (in the metadata file header) that split samples into two groups for comparison
        batch: seqrun               # Attribute name (in the metadata file header) that split samples into groups with similar batch effect for correction
        filter_significant:                # Filter transcripts with significant change in expression
            method: custom                 # Supported values: custom
            max_fdr: 0.05                  # Maximal value of fold discovery change for transcript to be reported
            min_fold_change: 1.5           # Minimal value of fold change for transcript to be reported
            reproducible_expression: True  # At least one read must be mapped to transcript in all samples from over-expressed group to be reported

    report:
        transcripts:

            count_table:               # Summary table with number of reads per transcript
                method: custom         # Supported values: custom

            html_table:                # Summary HTML table with results of differential expressions
                method: custom         # Supported values: custom
                annotation:            # Annotate with attributes from external sources
                    - source: pombase  # Annotate with attributes from reference/{reference}/annotation.transcripts/pombas/attributes.tsv
                      attributes:      # Annotate with listed attributes only
                        - link         # Annotate with attribute link
                    - source: go       # Annotate with attributes from reference/{reference}/annotation.transcripts/go/attributes.tsv
                                       # Annotate with all attributes since explicit attributes are not defined

            revigo:                    # GO annotation terms in format suitable for visualisation on the ReviGO website (http://revigo.irb.hr/)
                method: custom         # Supported values: custom


            pca:
                method: sklearn     # Supported values: sklearn
                formats:            # Output format of the resulting images
                    - png
                    - svg

Planned improvements¶

Aggregate quality statistics of preprocess and mapping with the MultiQC
Add mapped based approach with resulting coverage tracks

Analyse gene expression¶

Purpose¶

Required inputs¶

Generated outputs¶

Example¶

Planned improvements¶

Included pipelines¶

Table of Contents

Previous topic

Next topic

This Page