Useful bash aliases

Snakemake is very flexible in workflow execution, see its documentation for detailed description. Here you may find some useful aliases for multi-threaded execution, running on SGE cluster or visualization of individual steps of the pipeline.

  • dsnake - shows rules to be executed, but do not start analysis (–dryrun)
  • vsnake - visually examine rules to be executed and their dependencies
  • snake - runs standard SnakeLines pipeline in the current working directory
  • qsnake - distribute snake jobs on the cluster - use for multi-threaded rules
  • fsnake - distribute snake jobs on the cluster - use for single-threaded rules

Running SnakeLines

Edit your ~/.bash_aliases file, to load aliases at login to the system.

vim ~/.bash_aliases

Add aliases for using pipelines locally. You can set up using conda with USE_CONDA variable and location for downloaded tools from Anaconda repository.

THREADS_LOCAL=4     # Number of used threads, when running pipelines locally
SNAKELINES_DIR=/data/snakelines/$USER   # Path to snakelines source files
USE_CONDA=true
CONDA_DIR=/data/snakelines/snakemake_repos
if [ "$USE_CONDA" = true ] ; then
  alias basesnake='snakemake -d `pwd` --jobname {rulename}.{jobid} --reason --printshellcmds --snakefile $SNAKELINES_DIR/snakelines.snake' --use-conda --conda-prefix=$CONDA_DIR ; else
  alias basesnake='snakemake -d `pwd` --jobname {rulename}.{jobid} --reason --printshellcmds --snakefile $SNAKELINES_DIR/snakelines.snake'
fi

alias snake='basesnake --config threads=$THREADS_LOCAL --cores $THREADS_LOCAL'
alias dsnake='basesnake --config threads=$THREADS_LOCAL --dryrun'

function vsnake {
   dsnake $@ --rulegraph | dot | display
}

You may also run scripts on computational cluster

# SGE job scheduling system specific
SGE_PRIORITY=0   # Priority of tasks on cluster
SGE_THREADS=16   # Number of used threads, when running pipelines on cluster
SGE_NODES=10     # Number of computational nodes in the cluster
SGE_CORES=160    # Total number of cpus on all nodes in the cluster
SGE_PARAMS="qsub -cwd -o log/ -e log/ -p $PRIORITY -r yes" # Additional parameters for the SGE scheduler
SGE_QUEUE_MAIN=main.q
SGE_CLUSTER_PARAMS="qsub -cwd -o log/ -e log/ -p $SGE_PRIORITY -r yes"

function clustersnake {
    NOW=`date "+%y-%m-%d_%H-%M-%S"`
    LOGDIR=log/$NOW
    mkdir -p $LOGDIR
    rm -f log/last
    ln -f -s $NOW log/last
    basesnake --config threads=$1 --cluster "$SGE_CLUSTER_PARAMS -q $3 -l thr=$1 -o $LOGDIR -e $LOGDIR" --jobs $2 ${@:4}
}

alias qsnake='clustersnake 16 10 $SGE_QUEUE_MAIN'
alias fsnake='clustersnake 1 160 $SGE_QUEUE_MAIN'

Next time, you will log into system, aliases would be ready. When you do not want to re-login, you may reload aliases

source ~/.bash_aliases

Logging

Logs are stored in the logs/ directory, when running pipeline on the computational cluster using qsnake or fsnake alias. Each pipeline run has its own directory named by time of the execution. The last/ directory is link to the last executed analysis. For example:

logs/
   |-- 18-10-08_16-19-48/
   |-- 18-10-08_17-16-23/
   |-- last -> 18-10-08_17-16-23/

Summary logs of the last run may be examined using log alias. You must be inside of project directory to find which logs to display.

function log {

   TYPE=e
   if [ "$1" == "o" ]; then
      TYPE=o
   elif [ "$1" == "a" ]; then
      TYPE=
   fi

   cat `python -c 'import os; cwd = os.getcwd(); print("/".join(cwd.split("/")[:4]))'`/log/last/*.$TYPE* | less
}

Command log e will display only error messages, log o messages on standard stream.

Example run

Assuming you have input file in the SnakeLines compatible project structure, you may start analysis using these aliases

# Go to screen - analysis would not terminate, if connection fails
screen

# Run always in the project root directory
cd /data/projects/example

# First try dryrun, check if pipeline is correct
dsnake --configfile config_variant_calling.yaml

# Optionally visualise pipeline - but only on rules with small number of samples
vsnake --configfile config_variant_calling.yaml

# Run test analysis with one, small sample
snake --configfile config_variant_calling.yaml

# Distribute tasks for all samples on cluster
## For multi-threaded analysis
qsnake --configfile config_variant_calling.yaml
## For single-threaded analysis
fsnake --configfile config_variant_calling.yaml

Changing SGE queue

SGE engine supports organising computational nodes into groups, called queues. You may specify, which queue you want to use in the clustersnake command. Alternately, you may prepare your own aliases for each cluster queue.

For example, assume computational cluster with 8 computational nodes organised in groups:

  • main.q - nodes 1, 2, 3, 4, 5, 6, 7, 8
  • pat.q - nodes 1, 2, 3, 4
  • mat.q - nodes 5, 6, 7, 8

Aliases for SnakeLines calls may be specified as:

SGE_QUEUE_MAIN=main.q
alias qsnake='clustersnake 16 10 $SGE_QUEUE_MAIN'
alias fsnake='clustersnake 1 160 $SGE_QUEUE_MAIN'

# Cluster specific
SGE_QUEUE_MAT='mat.q'
SGE_QUEUE_PAT='pat.q'

alias qsnake.pat='clustersnake 16 4 $SGE_QUEUE_PAT'
alias fsnake.pat='clustersnake 1 64 $SGE_QUEUE_PAT'
alias qsnake.mat='clustersnake 16 4 $SGE_QUEUE_MAT'
alias fsnake.mat='clustersnake 1 64 $SGE_QUEUE_MAT'

Call of aliases would generate:

# Distribute tasks on all cluster nodes - 1-8
qsnake --configfile config_variant_calling.yaml

# Distribute tasks on 4 cluster nodes - 1-4
qsnake.pat --configfile config_variant_calling.yaml

# Distribute tasks on 4 cluster nodes - 5-8
qsnake.mat --configfile config_variant_calling.yaml