Analyse gene expression¶
Determine level of expression of individual transcripts from sequenced RNA. Compare expressions between samples from two conditions to identify transcripts with changed expression.
Purpose¶
- Identify which transcripts are expressed
- Compare expression of transcripts with each other in a single sample
- Compare changes of expression between samples with different conditions, e.g.
- environment
- tissues
- phenotypic trait
- Assess impact of a gene on biochemical pathways (compare wild type with sample without the gene)
Required inputs¶
- Sequenced paired-end reads from Illumina sequencer in gzipped fastq format.
- each sample is represented by two gzipped fastq files
- standard output files of paired-end sequencing
- Transcripts of reference genome in fasta format
- Sample metadata file in TSV format
- each row describe a single sequenced sample
- the first, ‘sample-id’ column represents names of fastq files without _R[12].fastq.gz part, e.g. sample_1, sample_2
- other columns represent attributes of sequenced samples that may be used to separate samples into categories for comparison, e.g. environment, tissue
|-- description
|-- sample-metadata.csv
|-- reads/original
|-- <sample_1>_R1.fastq.gz
|-- <sample_1>_R2.fastq.gz
|-- <sample_2>_R1.fastq.gz
|-- <sample_2>_R2.fastq.gz
|-- reference/<reference>
|-- <reference>.transcripts.fa
Generated outputs¶
- Summary table of number of transcripts per sample, normalised by read count and length of transcripts
- Graphical, 2D PCA comparison of samples based on their expression to visually assess relationships between them
- Set of transcripts with significant set of expression between selected conditions
Example¶
How to run example:
cd /usr/local/snakelines/example/rnaseq
snakemake \
--snakefile ../../snakelines.snake \
--configfile config_transcriptomics.yaml \
--use-conda
Example configuration:
sequencing: paired_end
samples: # List of sample categories to be analysed
- name: .* # Regex expression of sample names to be analysed (reads/original/.*-16S_R1.fastq.gz)
reference: pombe # Reference genome for reads in the category (reference/silva-16S/silva-16S.fa)
report_dir: report/public/01-rnaseq # Generated reports and essential output files would be stored there
threads: 16 # Number of threads to use in analysis
reads: # Prepare reads and quality reports for downstream analysis
preprocess: # Pre-process of reads, eliminate sequencing artifacts, contamination ...
trimmed: # Remove low quality parts of reads
method: trimmomatic # Supported values: trimmomatic
temporary: False # If True, generated files would be removed after successful analysis
crop: 500 # Maximal number of bases in read to keep. Longer reads would be truncated.
quality: 20 # Minimal average quality of read bases to keep (inside sliding window of length 5)
headcrop: 20 # Number of bases to remove from the start of read
minlen: 35 # Minimal length of trimmed read. Shorter reads would be removed.
report: # Summary reports of read characteristics to assess their quality
quality_report: # HTML summary report of read quality
method: fastqc # Supported values: fastqc
read_types: # List of preprocess steps for quality reports
- original
- trimmed
classification: # Identify genomic source of sequenced reads
transcripts: # Find homologue sequences by comparing reads to reference sequences
method: salmon # Supported values: salmon
library: # According to notation in https://salmon.readthedocs.io/en/latest/library_type.html
orientation: inward # Relative orientation of reads in DNA fragments - inward/outward/matching
strandedness: stranded # Preserve sequencing DNA strand of fragment? - stranded/unstranded
direction: reverse # Orientation of the first read in a fragment - forward/reverse
count_type: mapped_reads # Type of counts to report - mapped_reads/tpm (transcripts per kilobase million)
differential_analysis: # Find transcripts with significant change across two sample groups
method: edger # Supported values: edger
group_by: culture # Attribute name (in the metadata file header) that split samples into two groups for comparison
batch: seqrun # Attribute name (in the metadata file header) that split samples into groups with similar batch effect for correction
filter_significant: # Filter transcripts with significant change in expression
method: custom # Supported values: custom
max_fdr: 0.05 # Maximal value of fold discovery change for transcript to be reported
min_fold_change: 1.5 # Minimal value of fold change for transcript to be reported
reproducible_expression: True # At least one read must be mapped to transcript in all samples from over-expressed group to be reported
report:
transcripts:
count_table: # Summary table with number of reads per transcript
method: custom # Supported values: custom
html_table: # Summary HTML table with results of differential expressions
method: custom # Supported values: custom
annotation: # Annotate with attributes from external sources
- source: pombase # Annotate with attributes from reference/{reference}/annotation.transcripts/pombas/attributes.tsv
attributes: # Annotate with listed attributes only
- link # Annotate with attribute link
- source: go # Annotate with attributes from reference/{reference}/annotation.transcripts/go/attributes.tsv
# Annotate with all attributes since explicit attributes are not defined
revigo: # GO annotation terms in format suitable for visualisation on the ReviGO website (http://revigo.irb.hr/)
method: custom # Supported values: custom
pca:
method: sklearn # Supported values: sklearn
formats: # Output format of the resulting images
- png
- svg