Useful bash aliases =================== Snakemake is very flexible in workflow execution, see `its documentation `_ for detailed description. Here you may find some useful aliases for multi-threaded execution, running on SGE cluster or visualization of individual steps of the pipeline. * dsnake - shows rules to be executed, but do not start analysis (--dryrun) * vsnake - visually examine rules to be executed and their dependencies * snake - runs standard SnakeLines pipeline in the current working directory * qsnake - distribute snake jobs on the cluster - use for multi-threaded rules * fsnake - distribute snake jobs on the cluster - use for single-threaded rules Running SnakeLines ------------------ Edit your ~/.bash_aliases file, to load aliases at login to the system. .. code-block:: bash vim ~/.bash_aliases Add aliases for using pipelines locally. You can set up using conda with `USE_CONDA` variable and location for downloaded tools from Anaconda repository. .. code-block:: bash THREADS_LOCAL=4 # Number of used threads, when running pipelines locally SNAKELINES_DIR=/data/snakelines/$USER # Path to snakelines source files USE_CONDA=true CONDA_DIR=/data/snakelines/snakemake_repos if [ "$USE_CONDA" = true ] ; then alias basesnake='snakemake -d `pwd` --jobname {rulename}.{jobid} --reason --printshellcmds --snakefile $SNAKELINES_DIR/snakelines.snake' --use-conda --conda-prefix=$CONDA_DIR ; else alias basesnake='snakemake -d `pwd` --jobname {rulename}.{jobid} --reason --printshellcmds --snakefile $SNAKELINES_DIR/snakelines.snake' fi alias snake='basesnake --config threads=$THREADS_LOCAL --cores $THREADS_LOCAL' alias dsnake='basesnake --config threads=$THREADS_LOCAL --dryrun' function vsnake { dsnake $@ --rulegraph | dot | display } You may also run scripts on computational cluster .. code-block:: bash # SGE job scheduling system specific SGE_PRIORITY=0 # Priority of tasks on cluster SGE_THREADS=16 # Number of used threads, when running pipelines on cluster SGE_NODES=10 # Number of computational nodes in the cluster SGE_CORES=160 # Total number of cpus on all nodes in the cluster SGE_PARAMS="qsub -cwd -o log/ -e log/ -p $PRIORITY -r yes" # Additional parameters for the SGE scheduler SGE_QUEUE_MAIN=main.q SGE_CLUSTER_PARAMS="qsub -cwd -o log/ -e log/ -p $SGE_PRIORITY -r yes" function clustersnake { NOW=`date "+%y-%m-%d_%H-%M-%S"` LOGDIR=log/$NOW mkdir -p $LOGDIR rm -f log/last ln -f -s $NOW log/last basesnake --config threads=$1 --cluster "$SGE_CLUSTER_PARAMS -q $3 -l thr=$1 -o $LOGDIR -e $LOGDIR" --jobs $2 ${@:4} } alias qsnake='clustersnake 16 10 $SGE_QUEUE_MAIN' alias fsnake='clustersnake 1 160 $SGE_QUEUE_MAIN' Next time, you will log into system, aliases would be ready. When you do not want to re-login, you may reload aliases .. code-block:: bash source ~/.bash_aliases Logging ------- Logs are stored in the logs/ directory, when running pipeline on the computational cluster using qsnake or fsnake alias. Each pipeline run has its own directory named by time of the execution. The last/ directory is link to the last executed analysis. For example: :: logs/ |-- 18-10-08_16-19-48/ |-- 18-10-08_17-16-23/ |-- last -> 18-10-08_17-16-23/ Summary logs of the last run may be examined using log alias. You must be inside of project directory to find which logs to display. .. code-block:: bash function log { TYPE=e if [ "$1" == "o" ]; then TYPE=o elif [ "$1" == "a" ]; then TYPE= fi cat `python -c 'import os; cwd = os.getcwd(); print("/".join(cwd.split("/")[:4]))'`/log/last/*.$TYPE* | less } Command ``log e`` will display only error messages, ``log o`` messages on standard stream. Example run ----------- Assuming you have input file in the SnakeLines compatible project structure, you may start analysis using these aliases .. code-block:: bash # Go to screen - analysis would not terminate, if connection fails screen # Run always in the project root directory cd /data/projects/example # First try dryrun, check if pipeline is correct dsnake --configfile config_variant_calling.yaml # Optionally visualise pipeline - but only on rules with small number of samples vsnake --configfile config_variant_calling.yaml # Run test analysis with one, small sample snake --configfile config_variant_calling.yaml # Distribute tasks for all samples on cluster ## For multi-threaded analysis qsnake --configfile config_variant_calling.yaml ## For single-threaded analysis fsnake --configfile config_variant_calling.yaml Changing SGE queue ------------------ SGE engine supports organising computational nodes into groups, called queues. You may specify, which queue you want to use in the `clustersnake` command. Alternately, you may prepare your own aliases for each cluster queue. For example, assume computational cluster with 8 computational nodes organised in groups: * main.q - nodes 1, 2, 3, 4, 5, 6, 7, 8 * pat.q - nodes 1, 2, 3, 4 * mat.q - nodes 5, 6, 7, 8 Aliases for SnakeLines calls may be specified as: .. code-block:: bash SGE_QUEUE_MAIN=main.q alias qsnake='clustersnake 16 10 $SGE_QUEUE_MAIN' alias fsnake='clustersnake 1 160 $SGE_QUEUE_MAIN' # Cluster specific SGE_QUEUE_MAT='mat.q' SGE_QUEUE_PAT='pat.q' alias qsnake.pat='clustersnake 16 4 $SGE_QUEUE_PAT' alias fsnake.pat='clustersnake 1 64 $SGE_QUEUE_PAT' alias qsnake.mat='clustersnake 16 4 $SGE_QUEUE_MAT' alias fsnake.mat='clustersnake 1 64 $SGE_QUEUE_MAT' Call of aliases would generate: .. code-block:: bash # Distribute tasks on all cluster nodes - 1-8 qsnake --configfile config_variant_calling.yaml # Distribute tasks on 4 cluster nodes - 1-4 qsnake.pat --configfile config_variant_calling.yaml # Distribute tasks on 4 cluster nodes - 5-8 qsnake.mat --configfile config_variant_calling.yaml