Note: In this tutorial, we will write a simple PyRosetta script to disk, and demonstrate how to run it in parallel using GNU parallel. This Jupyter notebook uses parallelization and is not meant to be executed within a Google Colab environment.
Please see setup instructions in Chapter 16.00
import logging
logging.basicConfig(level=logging.INFO)
import os
import sys
if 'google.colab' in sys.modules:
print("This Jupyter notebook uses parallelization and is therefore not set up for the Google Colab environment.")
sys.exit(0)
%%file outputs/mutate_residue_from_command_line.py
import argparse
import os
import pyrosetta
import uuid
__author__ = "My Name"
__email__ = "name@email.com"
def main(target=None, new_res=None):
"""Example function to run my custom PyRosetta script."""
# Initialize PyRosetta with custom flags
pyrosetta.init("-ignore_unrecognized_res 1 -renumber_pdb 1 -mute all")
# Setup pose
pose = pyrosetta.pose_from_file("inputs/5JG9.clean.pdb")
# Setup directory structure
main_dir = os.getcwd()
unique_dir = os.path.join("outputs", "testing_" + uuid.uuid4().hex)
if not os.path.isdir(unique_dir):
os.mkdir(unique_dir)
os.chdir(unique_dir)
# Create scorefunction
scorefxn = pyrosetta.create_score_function("ref2015_cart")
# PyRosetta design protocol
keep_chA = pyrosetta.rosetta.protocols.grafting.simple_movers.KeepRegionMover(
res_start=str(pose.chain_begin(1)), res_end=str(pose.chain_end(1))
)
keep_chA.apply(pose)
mutate = pyrosetta.rosetta.protocols.simple_moves.MutateResidue(
target=target, new_res=new_res
)
mutate.apply(pose)
mm = pyrosetta.rosetta.core.kinematics.MoveMap()
mm.set_bb(True)
mm.set_chi(True)
min_mover = pyrosetta.rosetta.protocols.minimization_packing.MinMover()
min_mover.set_movemap(mm)
min_mover.score_function(scorefxn)
min_mover.min_type("lbfgs_armijo_nonmonotone")
min_mover.cartesian(True)
min_mover.tolerance(0.01)
min_mover.max_iter(200)
min_mover.apply(pose)
total_score = scorefxn(pose)
# Setup outputs
pdb_output_filename = "_".join(["5JG9.clean", str(target), str(new_res), ".pdb"])
pyrosetta.dump_pdb(pose, pdb_output_filename)
# Append scores to scorefile
pyrosetta.toolbox.py_jobdistributor.output_scorefile(
pose=pose, pdb_name="5JG9.clean", current_name=pdb_output_filename,
scorefilepath="score.fasc", scorefxn=scorefxn,
nstruct=1, native_pose=None, additional_decoy_info=None, json_format=True
)
os.chdir(main_dir)
if __name__ == "__main__":
# Declare parser object for managing input options
parser = argparse.ArgumentParser()
parser.add_argument("-t", "--target", type=int,
help="Target residue to mutate as integer.")
parser.add_argument("-r", "--new_res", type=str,
help="Three letter amino acid code to which to mutate target.")
args = parser.parse_args()
# Run protocol
main(target=args.target, new_res=args.new_res)
Writing outputs/mutate_residue_from_command_line.py
Now we will run this script in parallel several different ways to demonstrate different types of job submission styles.
On your laptop, you have access to as many cores as are on your machine. On a high-performance computing cluster with SLURM scheduling, you first have to request as many cores as you want to run on in an interactive login session:
qlogin -c 8 --mem=16g
will reserve 8 CPU cores and 16 GB of RAM for you and start your session on a node that has available resources.
Then, we need to write a run file to disc specifying our input parameters
with open("outputs/run_file_parallel_interactive.txt", "w") as f:
for target in [2, 4, 6, 8, 10]:
for new_res in ["ALA", "TRP"]:
f.write("{0} outputs/mutate_residue_from_command_line.py -t {1} -r {2}\n".format(sys.executable, target, new_res))
Note: it's always a good idea to run just one command first to make sure there aren't any errors in your script!
Note: if you don't specify the correct python executable, activate the correct conda environment in your interactive session first.
Now submit outputs/run_file_parallel_interactive.txt
to GNU parallel from the command line in your interactive session:
cat outputs/run_file_parallel_interactive.txt | parallel -j 8 --no-notice &
Note: The parallel
exectuable is usually located at /usr/bin/parallel
but the full path may differ on your computer. For installation info, visit: https://www.gnu.org/software/parallel/
We use GNU parallel again, but this time there is no need to pre-request server allocations. We can submit jobs to the Slurm scheduler from directly within this Jupyter Notebook or from command line!
Useful background information:
"Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. ... As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work." Read further: https://slurm.schedmd.com/overview.html
With the Slurm scheduler we will use the sbatch
command, therefore it may be useful to review the available options:
https://slurm.schedmd.com/sbatch.html
First, write a SLURM submission script to disk specifying the job requirements. In this example, our conda environment is called pyrosetta-bootcamp
:
with open("outputs/sbatch_parallel.sh", "w") as f:
f.write("#!/bin/bash \n") # Bash script
f.write("#SBATCH -p short \n") # Specify "short" partition/queue
f.write("#SBATCH -n 8 \n") # Specify eight cores
f.write("#SBATCH -N 1 \n") # Specify one node
f.write("#SBATCH --mem=16g \n") # Specify 16GB RAM over eight cores
f.write("#SBATCH -o sbatch_parallel.log \n") # Specify output log filename
f.write("conda activate pyrosetta-bootcamp \n") # Activate conda environment
f.write("cat outputs/run_file_parallel_interactive.txt | /usr/bin/parallel -j 8 --no-notice \n") # Submit jobs to GNU parallel
Then, submit outputs/sbatch_parallel.sh
to the SLURM scheduler with the sbatch
command:
if not os.getenv("DEBUG"):
!sbatch outputs/sbatch_parallel.sh
Submitted batch job 12942806
We can then periodically check on the status of our jobs:
!squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12942806 short sbatch.s klimaj PD 0:00 1 (Priority) 12931193 interacti zsh klimaj R 2:17:02 1 dig1 12931183 interacti jupyter- klimaj R 2:17:37 1 dig5 12935447 interacti zsh klimaj R 58:46 1 dig1
This time, we submit each job individually using the sbatch
command directly on our PyRosetta script.
Warning: do not submit more than ~1000 jobs with this method or risk clogging the SLURM scheduler!
First, copy your python executable and paste it on the first line of the PyRosetta script after #!
, followed by #SBATCH
commands for each job:
sys.executable
'/home/klimaj/anaconda3/envs/pyrosetta-dev/bin/python'
%%file outputs/mutate_residue_from_sbatch.py
#!/home/klimaj/anaconda3/envs/pyrosetta-bootcamp/bin/python
#SBATCH -p short
#SBATCH -n 1
#SBATCH --mem=2g
#SBATCH -o sbatch.log
import argparse
import os
import pyrosetta
import uuid
__author__ = "My Name"
__email__ = "name@email.com"
def main(target=None, new_res=None):
"""Example function to run my custom PyRosetta script.
"""
# Initialize PyRosetta with custom flags
pyrosetta.init("-ignore_unrecognized_res 1 -renumber_pdb 1 -mute all")
# Setup pose
pose = pyrosetta.pose_from_file("inputs/5JG9.clean.pdb")
# Setup directory structure
main_dir = os.getcwd()
unique_dir = os.path.join("outputs", "testing_" + uuid.uuid4().hex)
if not os.path.isdir(unique_dir):
os.mkdir(unique_dir)
os.chdir(unique_dir)
# Create scorefunction
scorefxn = pyrosetta.create_score_function("ref2015_cart")
# PyRosetta design protocol
keep_chA = pyrosetta.rosetta.protocols.grafting.simple_movers.KeepRegionMover(
res_start=str(pose.chain_begin(1)), res_end=str(pose.chain_end(1))
)
keep_chA.apply(pose)
mutate = pyrosetta.rosetta.protocols.simple_moves.MutateResidue(
target=target, new_res=new_res
)
mutate.apply(pose)
mm = pyrosetta.rosetta.core.kinematics.MoveMap()
mm.set_bb(True)
mm.set_chi(True)
min_mover = pyrosetta.rosetta.protocols.minimization_packing.MinMover()
min_mover.set_movemap(mm)
min_mover.score_function(scorefxn)
min_mover.min_type("lbfgs_armijo_nonmonotone")
min_mover.cartesian(True)
min_mover.tolerance(0.01)
min_mover.max_iter(200)
min_mover.apply(pose)
total_score = scorefxn(pose)
# Setup outputs
pdb_output_filename = "_".join(["5JG9.clean", str(target), str(new_res), ".pdb"])
pyrosetta.dump_pdb(pose, pdb_output_filename)
# Append scores to scorefile
pyrosetta.toolbox.py_jobdistributor.output_scorefile(
pose=pose, pdb_name="5JG9.clean", current_name=pdb_output_filename,
scorefilepath="score.fasc", scorefxn=scorefxn,
nstruct=1, native_pose=None, additional_decoy_info=None, json_format=True
)
os.chdir(main_dir)
if __name__ == "__main__":
# Declare parser object for managing input options
parser = argparse.ArgumentParser()
parser.add_argument("-t", "--target", type=int,
help="Target residue to mutate as integer.")
parser.add_argument("-r", "--new_res", type=str,
help="Three letter amino acid code to which to mutate target.")
args = parser.parse_args()
# Run protocol
main(target=args.target, new_res=args.new_res)
Writing outputs/mutate_residue_from_sbatch.py
Then, loop over your input parameters submitting the PyRosetta scripts to the scheduler using the sbatch
command:
if not os.getenv("DEBUG"):
for target in [2, 4, 6, 8, 10]:
for new_res in ["ALA", "TRP"]:
!sbatch ./outputs/mutate_residue_from_sbatch.py -t $target -r $new_res
Submitted batch job 12946283 Submitted batch job 12946284 Submitted batch job 12946285 Submitted batch job 12946286 Submitted batch job 12946287 Submitted batch job 12946288 Submitted batch job 12946289 Submitted batch job 12946290 Submitted batch job 12946291 Submitted batch job 12946292
We can then periodically check on the status of our jobs:
!squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12946286 short mutate_r klimaj PD 0:00 1 (Resources) 12946287 short mutate_r klimaj PD 0:00 1 (Priority) 12946288 short mutate_r klimaj PD 0:00 1 (Priority) 12946289 short mutate_r klimaj PD 0:00 1 (Priority) 12946290 short mutate_r klimaj PD 0:00 1 (Priority) 12946291 short mutate_r klimaj PD 0:00 1 (Priority) 12946292 short mutate_r klimaj PD 0:00 1 (Priority) 12931193 interacti zsh klimaj R 2:19:28 1 dig1 12931183 interacti jupyter- klimaj R 2:20:03 1 dig5 12935447 interacti zsh klimaj R 1:01:12 1 dig1 12946283 short mutate_r klimaj R 0:00 1 dig73 12946284 short mutate_r klimaj R 0:00 1 dig87 12946285 short mutate_r klimaj R 0:00 1 dig100