Snakemake & Sun Grid Engine


The pipeline of this project is computationally expensive, and relies on many different hyperparameters or drop-in algorithms. This makes it the perfect candidate for grid-based parallelization, where different hyperparameter-combinations are executed on different cluster nodes.

The image above depicts the dependency structure of the pipeline, with hyperparameters writen in Caps, and an example configuration for each. As can be seen, the pipeline can be separated into multiple steps, and oftentimes, a hyperparameter is only needed in a later step. This means, that many different hyperparameter-combinations of a later step can be executed independently and in parallel, as soon as the earlier step is done. A good way to write such programs is by using a make-like structure, where every step produces a file as output, which is then used as the input for the respective next step. This allows the result of intermediate steps to be used by subsequent steps. In my thesis, I used Snakemake for this.

Now, our institute has a computing cluster, however the grid management software is the decades-old Sun Grid EngineThe fact that it’s called Oracle Grid Engine since 2011 speaks for the age of this system!. Snakemake is made to run on such clusters, however the support does not reach that far back. Further, the grid has a very restrictive wall-time that kills processes that run too long, including the scheduler. Also I wanted a watcher, that allows to inspect the progress of the individual steps. All of this required heavy customization of the Snakemake-Grid-Profile to deal with the peculiarities. The result of my work is open-sourced at https://github.com/cstenkamp/Snakemake-IKW-SGE-Profile (with some documentation here). Next to this, a lot of work went into making all the steps of the pipeline both heavily parallelized and interruptable, allowing steps to be killed by the grid-watcher and continue working as soon as they are scheduled the next time - the code for this is for example here and used eg. here in the thesiscode-repository.

Scheduler-output

This field depicts how the scheduler can be inspected at runtime - next to also receiving for example telegram-notifications if exceptions occurs or the pipeline ran through:

Scheduler: Task-Id 736317, active for 0:03:47 on ramsauer
Active Jobs:
  Job 736319: does preprocess_descriptions_notranslate for 0:03:17 on phobos with 1 procs.
      Creates siddata2022/de_debug_False/mfauhtcsldp_onlyorig_minwords80/pp_descriptions.json
      Progress: Lemmatizing Descriptions    :  50%|█████     | 10843/21643 [01:26<01:03, 169.88it/s]
  Job 736322: does create_dissim_mat for 0:05:32 on rigel with 3 procs.
      Creates siddata2022/de_debug_False/mfauhcsd2_onlyorig_minwords80/embedding_tfidf/dissim_mat.json
      Progress: Creating dissimilarity matrix:   3%|▎         | 6/200 [00:47<20:55,  6.47s/it]
  Job 736323: does create_candidate_svm for 0:02:45 on beam with 6 procs.
      Creates siddata2022/de_debug_False/mfauhtcsldp_onlyorig_minwords80/embedding_tfidf/mds_200d/tfidf_ppmi_quadratic/featureaxes.json
      Progress: Creating Candidate SVMs [5 procs]:   0%|          | 0/6912 [00:00<?, ?it/s]
  Job 736324: does create_dissim_mat for 0:01:17 on cippy02 with 3 procs.
      Creates siddata2022/de_debug_False/mfauhcsd2_onlyorig_minwords80/embedding_ppmi/dissim_mat.json
      Progress: Creating dissimilarity matrix:   3%|▎         | 6/200 [00:43<20:16,  6.27s/it]
  Job 736327: does create_dissim_mat for 0:01:02 on cippy18 with 3 procs.
      Creates siddata2022/de_debug_False/mfhcsd2_onlyorig_minwords80/embedding_tfidf/dissim_mat.json
      Progress: Creating dissimilarity matrix:   2%|▏         | 3/200 [00:20<22:40,  6.91s/it]
  Job 736329: does create_dissim_mat.83.sh for 0:00:17 on vr6 with 3 procs.
  Job 736330: does create_dissim_mat.100.sh for 0:00:17 on bunda with 3 procs.
  Job 736332: does extract_candidate_terms for 0:00:02 on ocular with 1 procs.
      Creates siddata2022/de_debug_False/mfhcsd2_onlyorig_minwords80/candidate_terms_tfidf.json
Scheduled Jobs:
  Job 736333: will do create_candidate_svm.151.sh. scheduled for 0:26:14
  Job 736334: will do create_candidate_svm.136.sh. scheduled for 0:25:15
Finished Jobs:
  Job 736318: did preprocess_descriptions_notranslate at 00:35:57 on dodo, creating siddata2022/de_debug_False/mfhcsd2_onlyorig_minwords80/pp_descriptions.json
  Job 736321: did preprocess_descriptions_notranslate at 00:36:12 on altair, creating siddata2022/de_debug_False/mfauhcsd2_onlyorig_minwords80/pp_descriptions.json
Failed Jobs:
  Job 736320: failed at preprocess_descriptions_notranslate
  Job 736328: failed at extract_candidate_terms.43.sh