[program,review] Why not snakemake?

Motivation: “Why not snakemake?”

My feeling for snakemake underwent a complex journey as my skills and thoughts grow. From a hobbyist view, I totally hate it for several reasons:

Poor portability/conda-dependent

you cannot execute *.smk scripts with the commonly found python intepreter. This problem is similar to that of scons, whose script SConscript requires scons to execute, which requires scons for its execution.

This means snakemake suffers from portability curse. At the time of this blog, snakemake does not offer a binary on its github release https://github.com/snakemake/snakemake/releases/tag/v5.22.1, which means you will have to rely on conda or pip for installtion: pip is fragile, and conda could be confusing. As a pipeline management tool, it is an utter shame to not offer a compiled binary.

Customised syntax means a lot of learning

Different from scons, snakemake introduced many keywords that needs to be followed by a indentation, like rule:,onstart:, which are similar to native python keywords if:, while True:. Keywords are useful building blocks that enrich the program. The price, however, is an increase in the control flow complexity and in overhead for learning. Common python packages offer objects to be imported, which could be functions or classes, but keyword is not one of them. Given the fact that Python2 users could hate Python3 syntax, no wonder a Python user could hate Snakemake syntax: It’s basically a new language to be learned, with much less tutorials than Python.

Using Python means using brackets when specifying objects, and colons when writing control flows. SMK however specifies object using colons, and that’s why it feels undogmatic.

Too many features

This is not a minus point strictlly, but the smk cli interface is a bit difficult to navigate.

usage: snakemake [-h] [--dry-run] [--profile PROFILE]
                 [--cache [RULE [RULE ...]]] [--snakefile FILE] [--cores [N]]
                 [--local-cores N] [--resources [NAME=INT [NAME=INT ...]]]
                 [--set-threads [RULE=THREADS [RULE=THREADS ...]]]
                 [--default-resources [NAME=INT [NAME=INT ...]]]
                 [--config [KEY=VALUE [KEY=VALUE ...]]]
                 [--configfile FILE [FILE ...]]
                 [--envvars VARNAME [VARNAME ...]] [--directory DIR] [--touch]
                 [--keep-going] [--force] [--forceall]
                 [--forcerun [TARGET [TARGET ...]]]
                 [--prioritize TARGET [TARGET ...]]
                 [--batch RULE=BATCH/BATCHES] [--until TARGET [TARGET ...]]
                 [--omit-from TARGET [TARGET ...]] [--rerun-incomplete]
                 [--shadow-prefix DIR] [--report [FILE]]
                 [--report-stylesheet CSSFILE] [--edit-notebook TARGET]
                 [--notebook-listen IP:PORT] [--lint [{text,json}]]
                 [--export-cwl FILE] [--list] [--list-target-rules] [--dag]
                 [--rulegraph] [--filegraph] [--d3dag] [--summary]
                 [--detailed-summary] [--archive FILE]
                 [--cleanup-metadata FILE [FILE ...]] [--cleanup-shadow]
                 [--skip-script-cleanup] [--unlock] [--list-version-changes]
                 [--list-code-changes] [--list-input-changes]
                 [--list-params-changes] [--list-untracked]
                 [--delete-all-output] [--delete-temp-output]
                 [--bash-completion] [--keep-incomplete] [--version]
                 [--reason] [--gui [PORT]] [--printshellcmds] [--debug-dag]
                 [--stats FILE] [--nocolor] [--quiet] [--print-compilation]
                 [--verbose] [--force-use-threads] [--allow-ambiguity]
                 [--nolock] [--ignore-incomplete] [--latency-wait SECONDS]
                 [--wait-for-files [FILE [FILE ...]]] [--notemp]
                 [--keep-remote] [--keep-target-files]
                 [--allowed-rules ALLOWED_RULES [ALLOWED_RULES ...]]
                 [--max-jobs-per-second MAX_JOBS_PER_SECOND]
                 [--max-status-checks-per-second MAX_STATUS_CHECKS_PER_SECOND]
                 [-T RESTART_TIMES] [--attempt ATTEMPT]
                 [--wrapper-prefix WRAPPER_PREFIX]
                 [--default-remote-provider {S3,GS,FTP,SFTP,S3Mocked,gfal,gridftp,iRODS,AzBlob}]
                 [--default-remote-prefix DEFAULT_REMOTE_PREFIX]
                 [--no-shared-fs] [--greediness GREEDINESS] [--no-hooks]
                 [--overwrite-shellcmd OVERWRITE_SHELLCMD] [--debug]
                 [--runtime-profile FILE] [--mode {0,1,2}]
                 [--show-failed-logs] [--log-handler-script FILE]
                 [--log-service {none,slack}]
                 [--cluster CMD | --cluster-sync CMD | --drmaa [ARGS]]
                 [--cluster-config FILE] [--immediate-submit]
                 [--jobscript SCRIPT] [--jobname NAME]
                 [--cluster-status CLUSTER_STATUS] [--drmaa-log-dir DIR]
                 [--kubernetes [NAMESPACE]] [--container-image IMAGE]
                 [--tibanna] [--tibanna-sfn TIBANNA_SFN]
                 [--precommand PRECOMMAND]
                 [--tibanna-config TIBANNA_CONFIG [TIBANNA_CONFIG ...]]
                 [--google-lifesciences]
                 [--google-lifesciences-regions GOOGLE_LIFESCIENCES_REGIONS [GOOGLE_LIFESCIENCES_REGIONS ...]]
                 [--google-lifesciences-location GOOGLE_LIFESCIENCES_LOCATION]
                 [--google-lifesciences-keep-cache] [--use-conda]
                 [--list-conda-envs] [--conda-prefix DIR]
                 [--conda-cleanup-envs]
                 [--conda-cleanup-pkgs [{tarballs,cache}]]
                 [--conda-create-envs-only] [--conda-frontend {conda,mamba}]
                 [--use-singularity] [--singularity-prefix DIR]
                 [--singularity-args ARGS] [--use-envmodules]
                 [target [target ...]]

There maybe more reasons that I forgot to mention. But the two above are more than enough to keep me away from the smk world as a hobbyist. Overall, snakemake is just a tool optional to help with your objective.

Why snakemake is useful

Nevertheless, smk does deserve some compliments.

DAG specification in a makefile fashion, with easy visualisation

This is very useful for controlling dependency between files.

result: inputA, inputB
   shell: echo "creating result"

Compatibility for many cluster managment tools

It means easier integration into existing platforms (if a vanilla implementation does not throw an error)

Performance profiling

Useful for measuring time taken to run a pipeline and optimising the thing further.

Conclusion

Snakemake is a pipeline management tool that helps you structure your pipelines. But be prepared to invest some learning time before starting skating with it.

Example from snakePipes

https://raw.githubusercontent.com/maxplanck-ie/snakepipes/master/snakePipes/workflows/mRNA-seq/Snakefile

import os
import snakePipes.common_functions as cf


### snakemake_workflows initialization ########################################
maindir = os.path.dirname(os.path.dirname(workflow.basedir))

# load conda ENVs (path is relative to "shared/rules" directory)
globals().update(cf.set_env_yamls())

# load config file
globals().update(cf.load_configfile(workflow.overwrite_configfiles[0], config["verbose"]))
# load organism-specific data, i.e. genome indices, annotation, etc.
globals().update(cf.load_organism_data(genome, maindir, config["verbose"]))
# return the pipeline version in the log
cf.get_version()

# do workflow specific stuff now
include: os.path.join(workflow.basedir, "internals.snakefile")

### include modules of other snakefiles ########################################
################################################################################

cf.namesOKinR(samples)

# deeptools cmds
include: os.path.join(maindir, "shared", "tools" , "deeptools_cmds.snakefile")

## filtered annotation (GTF)
include: os.path.join(maindir, "shared", "rules", "filterGTF.snakefile")

## bamCoverage_RPKM
include: os.path.join(maindir, "shared", "rules", "deepTools_RNA.snakefile")


if not fromBAM:
    ## FASTQ: either downsample FASTQ files or create symlinks to input files
    include: os.path.join(maindir, "shared", "rules", "FASTQ.snakefile")

    ## FastQC
    if fastqc:
        include: os.path.join(maindir, "shared", "rules", "FastQC.snakefile")

    ## Trim
    if trim:
        include: os.path.join(maindir, "shared", "rules", "trimming.snakefile")

    ## sambamba
    include:os.path.join(maindir, "shared", "rules", "sambamba.snakefile")

    #umi_tools
    include: os.path.join(maindir, "shared", "rules", "umi_tools.snakefile")

    ## Allele-specific JOBS
    if "allelic-mapping" in mode:
        # Updated global vars if mode = "allelic-mapping"
        if allele_mode == 'create_and_map':
            star_index_allelic = 'snp_genome/star_Nmasked/Genome'
            if len(strains) == 1:
                allele_hybrid = 'single'
                SNPFile = "snp_genome/all_SNPs_" + strains[0] + "_" + genome + ".txt.gz"
            elif len(strains) == 2:
                allele_hybrid = 'dual'
                SNPFile = "snp_genome/all_" + strains[1] + "_SNPs_" + strains[0] + "_reference.based_on_" + genome + ".txt"

            include: os.path.join(maindir, "shared", "rules", "masked_genomeIndex.snakefile")
        elif allele_mode == 'map_only':
            star_index_allelic = NMaskedIndex
            SNPFile = SNPfile
        ## mapping rules
        include: os.path.join(maindir, "shared", "rules", "RNA_mapping_allelic.snakefile")
        ## SNPsplit
        include: os.path.join(maindir, "shared", "rules", "SNPsplit.snakefile")
        # deepTools QC
        include: os.path.join(maindir, "shared", "rules", "deepTools_RNA_allelic.snakefile")
    else:
        # HISAT2/STAR
        include: os.path.join(maindir, "shared", "rules", "RNA_mapping.snakefile")

    ## Salmon
    if "alignment-free" in mode:
        include: os.path.join(maindir, "shared", "rules", "Salmon.snakefile")
        ## Sleuth (on Salmon)
        if sampleSheet:
            include: os.path.join(maindir, "shared", "rules", "sleuth.snakefile")

else:
    fastqc=False
    trim=False
    downsample=None

    include: os.path.join(maindir, "shared", "rules", "LinkBam.snakefile")

if "allelic-mapping" in mode:
    ## featureCounts_allelic
    include: os.path.join(maindir, "shared", "rules", "featureCounts_allelic.snakefile")
else:
    ## non-allelic featureCounts
    include: os.path.join(maindir, "shared", "rules", "featureCounts.snakefile")

##Genomic_contamination
include: os.path.join(maindir, "shared", "rules", "GenomicContamination.snakefile")

## QC report
include:os.path.join(maindir, "shared", "rules", "RNA-seq_qc_report.snakefile")

## DESeq2
if sampleSheet:
    include: os.path.join(maindir, "shared", "rules", "DESeq2.snakefile")

## MultiQC
include: os.path.join(maindir, "shared", "rules", "multiQC.snakefile")

### conditional/optional rules #################################################
################################################################################
def run_FastQC(fastqc):
    readsUse = reads
    if not pairedEnd:
        readsUse = [reads[0]]
    if fastqc:
        return( expand("FastQC/{sample}{read}_fastqc.html", sample = samples, read = readsUse) )
    else:
        return([])


def run_Trimming(trim, fastqc):
    readsUse = reads
    if not pairedEnd:
        readsUse = [reads[0]]
    if trim and fastqc:
        return( expand(fastq_dir+"/{sample}{read}.fastq.gz", sample = samples, read = readsUse) +
                expand("FastQC_trimmed/{sample}{read}_fastqc.html", sample = samples, read = readsUse) )
    elif trim:
        return( expand(fastq_dir+"/{sample}{read}.fastq.gz", sample = samples, read = readsUse) )
    else:
        return([])


def run_alignment_free():
    if "alignment-free" in mode:
        file_list = [
        expand("Salmon/{sample}.quant.sf", sample=samples),
        "Salmon/TPM.transcripts.tsv",
        "Salmon/counts.transcripts.tsv",
        expand("Salmon/{sample}/abundance.h5", sample=samples) ]

        if sampleSheet:
            sample_name = os.path.splitext(os.path.basename(sampleSheet))[0]
            file_list.append( ["DESeq2_Salmon_{}/DESeq2.session_info.txt".format(sample_name)])
        if sampleSheet and cf.check_replicates(sampleSheet):
            file_list.append( ["sleuth_Salmon_{}/so.rds".format(sample_name)] )

        return(file_list)
    else:
        return([])


def run_alignment():
    if "alignment" in mode:
        file_list = [
        "Sambamba/flagstat_report_all.tsv",
        expand("filtered_bam/{sample}.filtered.bam", sample = samples),
        expand("filtered_bam/{sample}.filtered.bam.bai", sample = samples),
        expand("featureCounts/{sample}.counts.txt", sample=samples),
        "featureCounts/counts.tsv",
        "QC_report/QC_report_all.tsv"
        ]
        if sampleSheet:
            sample_name = os.path.splitext(os.path.basename(sampleSheet))[0]
            file_list.append( ["DESeq2_{}/DESeq2.session_info.txt".format(sample_name)] )
        return(file_list)
    else:
        return([])

def make_nmasked_genome():
    if allele_mode == 'create_and_map':
        genome1 = "snp_genome/" + strains[0] + '_SNP_filtering_report.txt'
        file_list = [
                genome1,
                SNPFile,
                star_index_allelic,
                ]
        return(file_list)
    else:
        return([])

def run_allelesp_mapping():
    if "allelic-mapping" in mode:
        allele_suffix = ['allele_flagged', 'genome1', 'genome2', 'unassigned']
        file_list = [
        expand("allelic_bams/{sample}.{suffix}.sorted.bam", sample = samples,
                        suffix = allele_suffix),
        expand("allelic_bams/{sample}.{suffix}.sorted.bam.bai", sample = samples,
                        suffix = allele_suffix),
        expand("bamCoverage/allele_specific/{sample}.{suffix}.RPKM.bw", sample = samples,
                        suffix = ['genome1', 'genome2']),
        expand("featureCounts/{sample}.allelic_counts.txt", sample=samples),
        "featureCounts/counts_allelic.tsv"
        ]
        if sampleSheet:
            sample_name = os.path.splitext(os.path.basename(sampleSheet))[0]
            file_list.append( ["DESeq2_{}/DESeq2.session_info.txt".format(sample_name)] )
        return(file_list)
    else:
        return([])

def run_deepTools_qc():
    if "deepTools_qc" in mode:
        file_list = [
        expand("bamCoverage/{sample}.RPKM.bw", sample = samples),
        expand("bamCoverage/{sample}.coverage.bw", sample = samples),
        expand("bamCoverage/{sample}.uniqueMappings.fwd.bw", sample = samples),
        expand("bamCoverage/{sample}.uniqueMappings.rev.bw", sample = samples),
        "deepTools_qc/plotEnrichment/plotEnrichment.tsv",
        expand("deepTools_qc/estimateReadFiltering/{sample}_filtering_estimation.txt",sample = samples)]
        if pairedEnd:
            file_list.append("deepTools_qc/bamPEFragmentSize/fragmentSize.metric.tsv")
        if len(samples)>1:
            file_list.append( ["deepTools_qc/multiBigwigSummary/coverage.bed.npz",
                              "deepTools_qc/plotCorrelation/correlation.pearson.bed_coverage.tsv",
                              "deepTools_qc/plotCorrelation/correlation.spearman.bed_coverage.tsv",
                              "deepTools_qc/plotPCA/PCA.bed_coverage.tsv"] )
        if 'allelic-mapping' in mode:
            file_list.append(["deepTools_qc/plotEnrichment/plotEnrichment_allelic.tsv"])
            if len(samples)>1:
                file_list.append( ["deepTools_qc/multiBigwigSummary/coverage_allelic.bed.npz",
                                   "deepTools_qc/plotCorrelation/correlation.pearson.bed_coverage_allelic.tsv",
                                   "deepTools_qc/plotCorrelation/correlation.spearman.bed_coverage_allelic.tsv",
                                   "deepTools_qc/plotPCA/PCA.bed_coverage_allelic.tsv"] )
        return(file_list)
    else:
        return([])

def run_GenomicContamination():
    if dnaContam:
       file_list = ["GenomicContamination/genomic_contamination_featurecount_report.tsv"]
       return (file_list)
    else:
       return([])

### execute before  starts #####################################################
################################################################################
onstart:
    if "verbose" in config and config["verbose"] and not fromBAM:
        print()
        print("--- Workflow parameters --------------------------------------------------------")
        print("mode:", mode)
        print("samples:", samples)
        print("paired:", pairedEnd)
        print("read extension:", reads)
        print("fastq dir:", fastq_dir)
        print("filter GTF:", filterGTF)
        print("FeatureCounts library type:", libraryType)
        print("Sample sheet:", sampleSheet)
        print("-" * 80, "\n")

        print("--- Environment ----------------------------------------------------------------")
        print("$TMPDIR: ", os.getenv('TMPDIR', ""))
        print("$HOSTNAME: ", os.getenv('HOSTNAME', ""))
        print("-" * 80, "\n")

    elif "verbose" in config and config["verbose"] and fromBAM:
        print("--- Workflow parameters --------------------------------------------------------")
        print("samples:" + str(samples))
        print("input dir:" + indir)
        print("paired:", pairedEnd)
        print("-" * 80)  # , "\n"

        print("--- Environment ----------------------------------------------------------------")
        print("$TMPDIR: " + os.getenv('TMPDIR', ""))
        print("$HOSTNAME: " + os.getenv('HOSTNAME', ""))
        print("-" * 80,)

    if toolsVersion:
        usedEnvs = [CONDA_SHARED_ENV, CONDA_RNASEQ_ENV]
        cf.writeTools(usedEnvs, outdir, "mRNA-seq", maindir)

### main rule ##################################################################
################################################################################

if not fromBAM:
    localrules: FASTQ1, FASTQ2, Salmon_wasabi, sleuth_Salmon


    rule all:
        input:
            run_FastQC(fastqc),
            run_Trimming(trim, fastqc),
            run_alignment_free(),            # Salmon
            run_alignment(),                 # classical mapping + counting
            run_allelesp_mapping(),        # allelic-mapping
            make_nmasked_genome(),
            run_deepTools_qc(),
            run_GenomicContamination(),
            "multiQC/multiqc_report.html"

else:
    localrules: Salmon_wasabi

    rule all:
        input:
            run_alignment(),
            run_deepTools_qc(),
            run_GenomicContamination(),
            "multiQC/multiqc_report.html"

### execute after  finished ####################################################
################################################################################
onsuccess:
    cf.cleanLogs(outdir, cluster_config)
    if "verbose" in config and config["verbose"]:
        print("\n--- RNA-seq workflow finished successfully! ------------------------------------\n")

onerror:
    print("\n !!! ERROR in RNA-seq workflow! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n")

Comments