! snakemake --version
8.18.2
! cat Snakefile
OUTDIR = "first_directory" SNDDIR = "second_directory" SMP = None def get_file_names(wildcards): ck_output = checkpoints.make_five_files.get(**wildcards).output[0] global SMP SMP, = glob_wildcards(os.path.join(ck_output, "{sample}.txt")) return expand(os.path.join(ck_output, "{SAMPLE}.txt"), SAMPLE=SMP) def get_second_files(wildcards): ck_output = checkpoints.make_five_files.get(**wildcards).output[0] SMP2, = glob_wildcards(os.path.join(ck_output, "{sample}.txt")) return expand(os.path.join(SNDDIR, "{SM}.tsv"), SM=SMP2) rule all: input: get_second_files, expand("list_of_files_{SAMPLE}.txt",SAMPLE=SMP) checkpoint make_five_files: output: directory(OUTDIR) params: o = OUTDIR shell: """ mkdir {output}; for D in $(seq 1 5); do touch {params.o}/$RANDOM.txt done """ rule copy_files: input: get_file_names output: os.path.join(SNDDIR, "{SAMPLE}.tsv") shell: """ touch {output} """ rule list_all_files: input: expand(os.path.join(SNDDIR, "{SAMPLE}.tsv"), SAMPLE=SMP) output: out = "list_of_files_{SAMPLE}.txt" shell: """ echo {input} > {output.out} """
In the source code, the {SAMPLE} wildcard uses the value of the SMP variable. However, SMP is initially defined as None, which may cause the code to not work correctly. When a checkpoint is executed, the SMP variable changes and must maintain a list of file names in the OUTDIR directories.
!snakemake -c 1 --debug-dag
Assuming unrestricted shared filesystem usage. Building DAG of jobs... candidate job all wildcards: candidate job make_five_files wildcards: selected job make_five_files wildcards: candidate job list_all_files wildcards: SAMPLE=None candidate job copy_files wildcards: SAMPLE=None candidate job make_five_files wildcards: selected job make_five_files wildcards: selected job copy_files wildcards: SAMPLE=None selected job list_all_files wildcards: SAMPLE=None selected job all wildcards: candidate job copy_files wildcards: SAMPLE=None selected job copy_files wildcards: SAMPLE=None candidate job all wildcards: candidate job copy_files wildcards: SAMPLE=10536 selected job copy_files wildcards: SAMPLE=10536 candidate job copy_files wildcards: SAMPLE=16393 selected job copy_files wildcards: SAMPLE=16393 candidate job copy_files wildcards: SAMPLE=16815 selected job copy_files wildcards: SAMPLE=16815 candidate job copy_files wildcards: SAMPLE=2323 selected job copy_files wildcards: SAMPLE=2323 candidate job copy_files wildcards: SAMPLE=6362 selected job copy_files wildcards: SAMPLE=6362 candidate job list_all_files wildcards: SAMPLE=None selected job list_all_files wildcards: SAMPLE=None selected job all wildcards: Nothing to be done (all requested files are present and up to date).
The logs show that after executing make_five_files, the {SAMPLE} wildcard gets the correct values (for example, 10536, 16393, etc.). copy_files and list_all_files are selected multiple times with SAMPLE=None, which is not correct behavior.
!ls
SM2.ipynb Snakefile first_directory list_of_files_None.txt second_directory
!ls first_directory/
10536.txt 16393.txt 16815.txt 2323.txt 6362.txt
!ls second_directory/
10536.tsv 16393.tsv 16815.tsv 2323.tsv 6362.tsv None.tsv
Using a function that returns a list of file names from a directory. The OUTDIR directory is used as input in the all rule.
! cat Snakefile
import os OUTDIR = "first_directory" def get_txt_files(wildcards): ck_output = checkpoints.make_five_files.get(**wildcards).output[0] print([file.split(".")[0] for file in os.listdir(ck_output) if file.endswith('.txt')]) return [file.split(".")[0] for file in os.listdir(ck_output) if file.endswith('.txt')] rule all: input: OUTDIR, expand(f"{OUTDIR}/"+"{SMP}.doc", SMP=get_txt_files) checkpoint make_five_files: output: directory(OUTDIR) params: o = OUTDIR shell: """ mkdir {output}; for D in $(seq 1 5); do touch {params.o}/$RANDOM.txt done """ rule copy_files: input: "{SMP}.txt" output: out = "{SMP}.tsv" shell: """ touch {output.out} """ rule copy2: input: "{SMP}.tsv" output: "{SMP}.doc" shell: """ touch {output} """
! snakemake -c 1 -F
Assuming unrestricted shared filesystem usage. Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count --------------- ------- all 1 make_five_files 1 total 2 Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localcheckpoint make_five_files: output: first_directory jobid: 1 reason: Forced execution resources: tmpdir=/tmp DAG of jobs will be updated after completion. [Thu Aug 22 17:43:06 2024] Finished job 1. 1 of 2 steps (50%) done ['10576', '23600', '21071', '24382', '17800'] Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localrule copy_files: input: first_directory/10576.txt output: first_directory/10576.tsv jobid: 5 reason: Forced execution wildcards: SMP=first_directory/10576 resources: tmpdir=/tmp [Thu Aug 22 17:43:06 2024] Finished job 5. 2 of 12 steps (17%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localrule copy2: input: first_directory/10576.tsv output: first_directory/10576.doc jobid: 4 reason: Forced execution wildcards: SMP=first_directory/10576 resources: tmpdir=/tmp [Thu Aug 22 17:43:06 2024] Finished job 4. 3 of 12 steps (25%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localrule copy_files: input: first_directory/23600.txt output: first_directory/23600.tsv jobid: 7 reason: Forced execution wildcards: SMP=first_directory/23600 resources: tmpdir=/tmp [Thu Aug 22 17:43:06 2024] Finished job 7. 4 of 12 steps (33%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localrule copy2: input: first_directory/23600.tsv output: first_directory/23600.doc jobid: 6 reason: Forced execution wildcards: SMP=first_directory/23600 resources: tmpdir=/tmp [Thu Aug 22 17:43:06 2024] Finished job 6. 5 of 12 steps (42%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localrule copy_files: input: first_directory/21071.txt output: first_directory/21071.tsv jobid: 9 reason: Forced execution wildcards: SMP=first_directory/21071 resources: tmpdir=/tmp [Thu Aug 22 17:43:06 2024] Finished job 9. 6 of 12 steps (50%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localrule copy2: input: first_directory/21071.tsv output: first_directory/21071.doc jobid: 8 reason: Forced execution wildcards: SMP=first_directory/21071 resources: tmpdir=/tmp [Thu Aug 22 17:43:06 2024] Finished job 8. 7 of 12 steps (58%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localrule copy_files: input: first_directory/17800.txt output: first_directory/17800.tsv jobid: 13 reason: Forced execution wildcards: SMP=first_directory/17800 resources: tmpdir=/tmp [Thu Aug 22 17:43:06 2024] Finished job 13. 8 of 12 steps (67%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localrule copy2: input: first_directory/17800.tsv output: first_directory/17800.doc jobid: 12 reason: Forced execution wildcards: SMP=first_directory/17800 resources: tmpdir=/tmp [Thu Aug 22 17:43:06 2024] Finished job 12. 9 of 12 steps (75%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localrule copy_files: input: first_directory/24382.txt output: first_directory/24382.tsv jobid: 11 reason: Forced execution wildcards: SMP=first_directory/24382 resources: tmpdir=/tmp [Thu Aug 22 17:43:06 2024] Finished job 11. 10 of 12 steps (83%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localrule copy2: input: first_directory/24382.tsv output: first_directory/24382.doc jobid: 10 reason: Forced execution wildcards: SMP=first_directory/24382 resources: tmpdir=/tmp [Thu Aug 22 17:43:06 2024] Finished job 10. 11 of 12 steps (92%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:43:06 2024] localrule all: input: first_directory, first_directory/10576.doc, first_directory/23600.doc, first_directory/21071.doc, first_directory/24382.doc, first_directory/17800.doc jobid: 0 reason: Forced execution resources: tmpdir=/tmp [Thu Aug 22 17:43:06 2024] Finished job 0. 12 of 12 steps (100%) done Complete log: .snakemake/log/2024-08-22T174306.319827.snakemake.log
! ls
first_directory Snakefile Untitled.ipynb
! ls first_directory/
10576.doc 17800.doc 21071.doc 23600.doc 24382.doc 10576.tsv 17800.tsv 21071.tsv 23600.tsv 24382.tsv 10576.txt 17800.txt 21071.txt 23600.txt 24382.txt
Let's add the creation of files in a new directory. The directory name is specified by the SNDDIR variable at the beginning of the file.
! cat Snakefile
import os OUTDIR = "first_directory" SNDDIR = "second_directory" def get_txt_files(wildcards): ck_output = checkpoints.make_five_files.get(**wildcards).output[0] print([file.split(".")[0] for file in os.listdir(ck_output) if file.endswith('.txt')]) return [file.split(".")[0] for file in os.listdir(ck_output) if file.endswith('.txt')] rule all: input: OUTDIR, expand(f"{SNDDIR}/"+"{SMP}.doc", SMP=get_txt_files) checkpoint make_five_files: output: directory(OUTDIR) params: o = OUTDIR shell: """ mkdir {output}; for D in $(seq 1 5); do touch {params.o}/$RANDOM.txt done """ rule copy_files: input: f"{OUTDIR}/"+"{SMP}.txt" output: out = f"{SNDDIR}/"+"{SMP}.tsv" shell: """ touch {output.out} """ rule copy2: input: f"{SNDDIR}/"+"{SMP}.tsv" output: f"{SNDDIR}/"+"{SMP}.doc" shell: """ touch {output} """
! snakemake -c 1 -F
Assuming unrestricted shared filesystem usage. Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job stats: job count --------------- ------- all 1 make_five_files 1 total 2 Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:10 2024] localcheckpoint make_five_files: output: first_directory jobid: 1 reason: Forced execution resources: tmpdir=/tmp DAG of jobs will be updated after completion. [Thu Aug 22 17:52:11 2024] Finished job 1. 1 of 2 steps (50%) done ['5114', '11823', '32502', '6262', '21833'] Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:11 2024] localrule copy_files: input: first_directory/5114.txt output: second_directory/5114.tsv jobid: 5 reason: Forced execution wildcards: SMP=5114 resources: tmpdir=/tmp [Thu Aug 22 17:52:11 2024] Finished job 5. 2 of 12 steps (17%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:11 2024] localrule copy2: input: second_directory/5114.tsv output: second_directory/5114.doc jobid: 4 reason: Forced execution wildcards: SMP=5114 resources: tmpdir=/tmp [Thu Aug 22 17:52:11 2024] Finished job 4. 3 of 12 steps (25%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:11 2024] localrule copy_files: input: first_directory/11823.txt output: second_directory/11823.tsv jobid: 7 reason: Forced execution wildcards: SMP=11823 resources: tmpdir=/tmp [Thu Aug 22 17:52:11 2024] Finished job 7. 4 of 12 steps (33%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:11 2024] localrule copy2: input: second_directory/11823.tsv output: second_directory/11823.doc jobid: 6 reason: Forced execution wildcards: SMP=11823 resources: tmpdir=/tmp [Thu Aug 22 17:52:11 2024] Finished job 6. 5 of 12 steps (42%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:11 2024] localrule copy_files: input: first_directory/32502.txt output: second_directory/32502.tsv jobid: 9 reason: Forced execution wildcards: SMP=32502 resources: tmpdir=/tmp [Thu Aug 22 17:52:11 2024] Finished job 9. 6 of 12 steps (50%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:11 2024] localrule copy2: input: second_directory/32502.tsv output: second_directory/32502.doc jobid: 8 reason: Forced execution wildcards: SMP=32502 resources: tmpdir=/tmp [Thu Aug 22 17:52:11 2024] Finished job 8. 7 of 12 steps (58%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:11 2024] localrule copy_files: input: first_directory/21833.txt output: second_directory/21833.tsv jobid: 13 reason: Forced execution wildcards: SMP=21833 resources: tmpdir=/tmp [Thu Aug 22 17:52:11 2024] Finished job 13. 8 of 12 steps (67%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:11 2024] localrule copy2: input: second_directory/21833.tsv output: second_directory/21833.doc jobid: 12 reason: Forced execution wildcards: SMP=21833 resources: tmpdir=/tmp [Thu Aug 22 17:52:11 2024] Finished job 12. 9 of 12 steps (75%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:11 2024] localrule copy_files: input: first_directory/6262.txt output: second_directory/6262.tsv jobid: 11 reason: Forced execution wildcards: SMP=6262 resources: tmpdir=/tmp [Thu Aug 22 17:52:11 2024] Finished job 11. 10 of 12 steps (83%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:11 2024] localrule copy2: input: second_directory/6262.tsv output: second_directory/6262.doc jobid: 10 reason: Forced execution wildcards: SMP=6262 resources: tmpdir=/tmp [Thu Aug 22 17:52:11 2024] Finished job 10. 11 of 12 steps (92%) done Select jobs to execute... Execute 1 jobs... [Thu Aug 22 17:52:11 2024] localrule all: input: first_directory, second_directory/5114.doc, second_directory/11823.doc, second_directory/32502.doc, second_directory/6262.doc, second_directory/21833.doc jobid: 0 reason: Forced execution resources: tmpdir=/tmp [Thu Aug 22 17:52:11 2024] Finished job 0. 12 of 12 steps (100%) done Complete log: .snakemake/log/2024-08-22T175210.669728.snakemake.log
! ls
first_directory second_directory Snakefile Untitled.ipynb
! ls first_directory/
11823.txt 21833.txt 32502.txt 5114.txt 6262.txt
! ls second_directory/
11823.doc 21833.doc 32502.doc 5114.doc 6262.doc 11823.tsv 21833.tsv 32502.tsv 5114.tsv 6262.tsv