Map paired-end reads to a reference genome ========================================== Align sequenced paired-end reads to a reference genome to find the most probable genomic positions of origin. Reference genome may represent DNA of a single organism, multiple organisms, transcripts... Purpose ------- * Identify the source of sequenced material * Assess composition of sequenced material * Removal of a known contamination * Purification of sequenced reads - select only reads from the target organism * Essential step for several downstream analysis types - variant calling, transcriptomics ... Required inputs --------------- * Sequenced paired-end reads from Illumina sequencer in gzipped fastq format. * each sample is represented by two gzipped fastq files * standard output files of paired-end sequencing * Reference genome in fasta format :: |-- reads/original |-- _R1.fastq.gz |-- _R2.fastq.gz |-- _R1.fastq.gz |-- _R2.fastq.gz |-- reference/ |-- .fa Generated outputs ----------------- * Mapped reads in sorted, indexed BAM format with marked duplicates * Reports to assess mapping quality * individual report for each sample * summary report for comparison of multiple samples Example ------- How to run example: .. code-block:: bash cd /usr/local/snakelines/example/genomic snakemake \ --snakefile ../../snakelines.snake \ --configfile config_mapping.yaml \ --use-conda Example configuration: .. literalinclude:: ../../example/genomic/config_mapping.yaml :language: yaml Planned improvements -------------------- * Aggregate quality statistics of preprocess and mapping with the `MultiQC `_ * Realignment postprocess step to refine alignment in indel regions Included pipelines ------------------ .. toctree:: :maxdepth: 2 /pipelines/quality_report /pipelines/preprocess_paired_end