Notebook

This notebook contains material from PyRosetta; content is available on Github.

< PyRosettaCluster Tutorial 1A. Simple protocol | Contents | Index | PyRosettaCluster Tutorial 2. Multiple protocols >

PyRosettaCluster Tutorial 1B. Reproduce simple protocol¶

PyRosettaCluster Tutorial 1B uses the pyrosetta.distributed.cluster python module to reproduce a decoy generated by a PyRosetta simulation previosly run in PyRosettaCluster Tutorial 1A, using only an input .pdb file and the original user-provided PyRosetta protocol(s).

In PyRosettaCluster Tutorial 1A, you used PyRosettaCluster to apply a PyRosetta protocol to an input .pdb file, and generated several output .pdb files. Each output .pdb file contains information needed to exactly reproduce it.

Warning: This notebook uses pyrosetta.distributed.viewer code, which runs in jupyter notebook and might not run if you're using jupyterlab.

Note: This Jupyter notebook uses parallelization and is not meant to be executed within a Google Colab environment.

Note: This Jupyter notebook requires the PyRosetta distributed layer which is obtained by building PyRosetta with the --serialization flag or installing PyRosetta from the RosettaCommons conda channel

Please see Chapter 16.00 for setup instructions

Note: This Jupyter notebook is intended to be run within Jupyter Lab, but may still be run as a standalone Jupyter notebook.

1. Import packages¶

In [ ]:

import bz2
import json
import glob
import logging
import os
import pandas as pd
import pyrosetta
import pyrosetta.distributed.io as io
import pyrosetta.distributed.viewer as viewer

from pyrosetta.distributed.cluster import PyRosettaCluster, reproduce

logging.basicConfig(level=logging.INFO)

2. Initialize a compute cluster using `dask`¶

See Tutorial 1A to review:

Click the "Dask" tab in Jupyter Lab (arrow, left)
Click the "+ NEW" button to launch a new compute cluster (arrow, lower)
Once the cluster has started, click the brackets to "inject client code" for the cluster into your notebook

Inject client code here, then run the cell:

In [ ]:

if not os.getenv("DEBUG"):
    from dask.distributed import Client

    client = Client("tcp://127.0.0.1:40329")
else:
    client = None
client

3. Re-define or import the original user-provided PyRosetta protocol:¶

The purpose of the sha1 attribute of PyRosettaCluster is to ensures that you have committed all of your untracked changes into your git repository before executing the original simulation. When you run the reproduce function, the original sha1 attribute of PyRosettaCluster was captured in the output decoy .pdb file which ensures that you have checked out the same git SHA1 hash before reproducing the simulation. In this way, my_protocol remains statically captured at the git SHA1 hash from the original simulation. However, you may always update my_protocol, commit your changes to your git repository, and re-run the simulation, because the sha1 attribute of PyRosettaCluster automatically detects the new git SHA1 hash in your git repository.

In [ ]:

if not os.getenv("DEBUG"):
    from additional_scripts.my_protocols import my_protocol
    client.upload_file("additional_scripts/my_protocols.py") # This sends a local file up to all worker nodes.

4. Reproduce the original decoy:¶

The simulation in Tutorial 1A generated four decoys (because nstruct=4 in the original simulation). Let's say we'd like to reproduce the decoy with the lowest energy. First, let's inspect the results with the pandas library:

In [ ]:

if not os.getenv("DEBUG"):
    original_results = glob.glob(os.path.join(os.getcwd(), "outputs_1A", "decoys", "*", "*.pdb.bz2"))

    data = {}
    for original_result in original_results:
        with open(original_result, "rb") as f:
            pdbstring = bz2.decompress(f.read()).decode()
            for line in reversed(pdbstring.split("\n")):
                remark = "REMARK PyRosettaCluster: "
                if line.startswith(remark):
                    data[original_result] = json.loads(line.split(remark)[-1])["scores"]
                    break

    df = pd.DataFrame().from_records(data).T
    df

Now locate the decoy with the lowest Rosetta total_score to reproduce:

In [ ]:

if not os.getenv("DEBUG"):
    decoy_to_reproduce = df.sort_values(by="total_score", ascending=True).index[0]
    decoy_to_reproduce

5. Launch the reproduction simulation using `reproduce()`:¶

Reproducing the decoy is accomplished with the reproduce() function of the pyrosettacluster module. This method requires the .pdb or .pdb.bz2 file to reproduce: input_file. Alternatively, a scorefile with full simulation records and a decoy_name may be provided to reproduce() instead of the .pdb or .pdb.bz2 file. The user-provided PyRosetta protocol(s) must be defined or imported and input into reproduce() as the protocols argument parameter. The user is responsible for supplying the same protocol that was used in the original simulation! Additionally, any supplied instance_kwargs will override any PyRosettaCluster instance attributes from the input_file or scorefile. This may be useful when, for example, you want to change your cluster configuration while reproducing a decoy.

In [ ]:

if not os.getenv("DEBUG"):
    output_path = os.path.join(os.getcwd(), "outputs_1B")

    reproduce(
        input_file=decoy_to_reproduce,
        input_packed_pose=None, # Optional, if you used the `input_packed_pose` attribute of `PyRosettaCluster` in the original simulation
        client=client, # Optional
        instance_kwargs={"output_path": output_path, "nstruct": 1}, # Specify new output path, and set `nstruct` to 1 to reproduce the decoy only once. 
        protocols=[my_protocol],
    )

6. Visualize the reproduced decoy:¶

In [ ]:

if not os.getenv("DEBUG"):
    reproduced_results = glob.glob(os.path.join(output_path, "decoys", "*", "*.pdb.bz2"))
    assert len(reproduced_results) == 1
    with open(reproduced_results[0], "rb") as f:
        reproduced_packed_pose = io.pose_from_pdbstring(bz2.decompress(f.read()).decode())

In [ ]:

if not os.getenv("DEBUG"):
    view = viewer.init(reproduced_packed_pose, window_size=(800, 600))
    view.add(viewer.setStyle())
    view.add(viewer.setStyle(colorscheme="whiteCarbon", radius=0.25))
    view.add(viewer.setHydrogenBonds())
    view.add(viewer.setHydrogens(polar_only=True))
    view.add(viewer.setDisulfides(radius=0.25))
    view()

7. Optionally, perform sanity checks to confirm that the reproduced decoy is identical to the original decoy:¶

PyRosetta trajectories are deterministic depending on the input random number generated seed(s)!

In [ ]:

if not os.getenv("DEBUG"):
    with open(decoy_to_reproduce, "rb") as f:
        original_packed_pose = io.pose_from_pdbstring(bz2.decompress(f.read()).decode())
    original_pose = original_packed_pose.pose
    reproduced_pose = reproduced_packed_pose.pose

Assert that the sequences are identical:¶

In [ ]:

if not os.getenv("DEBUG"):
    assert original_pose.sequence() == reproduced_pose.sequence()

Assert that the `total_score`s are identical:¶

In [ ]:

if not os.getenv("DEBUG"):
    scorefxn = pyrosetta.create_score_function("ref2015.wts")
    assert scorefxn(original_pose) == scorefxn(reproduced_pose)

Assert that the C$_{\alpha}$–C$_{\alpha}$ root-mean-square deviation (RMSD) is `0.0` Å:¶

Note: There is no need to first superimpose the original_pose and reproduced_pose because they were both generated starting from the same input_packed_pose

In [ ]:

if not os.getenv("DEBUG"):
    assert pyrosetta.rosetta.core.scoring.CA_rmsd(original_pose, reproduced_pose) == 0.0

Congrats!¶

You have successfully reproduced a PyRosetta simulation using the pyrosetta.distributed.cluster module!

< PyRosettaCluster Tutorial 1A. Simple protocol | Contents | Index | PyRosettaCluster Tutorial 2. Multiple protocols >

PyRosettaCluster Tutorial 1B. Reproduce simple protocol¶

1. Import packages¶

2. Initialize a compute cluster using dask¶

3. Re-define or import the original user-provided PyRosetta protocol:¶

4. Reproduce the original decoy:¶

5. Launch the reproduction simulation using reproduce():¶

6. Visualize the reproduced decoy:¶

7. Optionally, perform sanity checks to confirm that the reproduced decoy is identical to the original decoy:¶

Assert that the sequences are identical:¶

Assert that the total_scores are identical:¶

Assert that the C$_{\alpha}$–C$_{\alpha}$ root-mean-square deviation (RMSD) is 0.0 Å:¶

Congrats!¶

2. Initialize a compute cluster using `dask`¶

5. Launch the reproduction simulation using `reproduce()`:¶

Assert that the `total_score`s are identical:¶

Assert that the C$_{\alpha}$–C$_{\alpha}$ root-mean-square deviation (RMSD) is `0.0` Å:¶