This largely builds on the notebook Demonstration of PDBsum ligand interface data to dataframe script and extends it to get some more general information about the experimentally-determined structure besides the ligand to help guide an answer to Biostars post How to get specific information from a list of PDB ID and save all in a spreadsheet?. As dicussed in there, because this isn't derived data, there's no reason in particular to use PDBsum for extracting this information. PDBfinder database may be better; however, for getting derived data, I already had the pipeline working with PDBsum and it was just a matter of adapting a couple scripts to get some details associated with the structures that are also listed at other places such as the Protein Data Bank, the OCA Database and Browse, or Proteopedia.
Let's make a file listing PDB accesion codes each on a separate line like we might get sent by someone looking for information on many PDB entries. (This is mostly so this notebook is self-contained.)
pdb_ids_each_online='''1wsv
1eve
1btn
1trn'''
%store pdb_ids_each_online >pdb_ids.txt
Writing 'pdb_ids_each_online' (str) to file 'pdb_ids.txt'.
Get the latest versions of the related scripts.
# Get a file if not yet retrieved / check if file exists
import os
file_needed = "pdbsum_stats_and_info_adpated_example.py"
if not os.path.isfile(file_needed):
!curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbsum-utilities/{file_needed}
file_needed = "pdb_ids_to_stats_and_info_df.py"
if not os.path.isfile(file_needed):
!curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbsum-utilities/{file_needed}
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 14198 100 14198 0 0 70287 0 --:--:-- --:--:-- --:--:-- 70287 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 13917 100 13917 0 0 69934 0 --:--:-- --:--:-- --:--:-- 69934
Say interested in example PDBsum page from here on the 'Top page' tab and we'd like to mine out some details.
Quick example of use of pdbsum_stats_and_info_adpated_example.py
to get information on a single PDB entry.
First, along the lines of how you'd use it on the command line. (%run is special to notebooks; you'd replace %run
in the command above with python
or appropriate call for your python on your own command line.)
%run pdbsum_stats_and_info_adpated_example.py 1eve
# read in the dataframe produced
import pandas as pd
df_made = pd.read_pickle("statsninfo_pickled_df.pkl")
df_made
Details for specified structure read and converted to a dataframe... A dataframe of the data has been saved as a file in a manner where other Python programs can access it (pickled form). RESULTING DATAFRAME is stored as ==> 'statsninfo_pickled_df.pkl'
PDB id | Resolution | R value | Ligands | |
---|---|---|---|---|
0 | 1eve | 2.50 Å | 0.188 | NAG-NAG, NAG, E20 |
(%run is special to notebooks; you'd replace %run
in the command above with python
or appropriate call for your python on your own command line.)
You could also use it inside a Jupyter notebook by importin the main function and then using a function call supplying the PDB code identifier of interest as an argument.
from pdbsum_stats_and_info_adpated_example import pdbsum_stats_and_info_adpated_example
df_func = pdbsum_stats_and_info_adpated_example("1trn")
df_func
Details for specified structure read and converted to a dataframe... A dataframe of the data has been saved as a file in a manner where other Python programs can access it (pickled form). RESULTING DATAFRAME is stored as ==> 'statsninfo_pickled_df.pkl' Returning a dataframe with the information as well.
PDB id | Resolution | R value | Ligands | |
---|---|---|---|---|
0 | 1trn | 2.20 Å | 0.177 | ISP |
Note that retrurned a dataframe directly so we didn't necessarily have to read it back in to memory like for the command line version. Also note that it did indeed save a file so that we could take that and later use it elsewhere and bring the dataframe back into memory without generating it first.
This section will demonstrate getting information on several PDB entries provided as a file with each identifier code listed on a separate line.
A file listing the PDB id codes was already saved during the preparation. We can show it here.
!cat pdb_ids.txt
1wsv 1eve 1btn 1trn
Now we can point the script that will take that list and run it as many times as necessary to collect the information for each PBB code listed.
As above, first we demonstrate this similar to how it would be used on the command line. (%run is special to notebooks; you'd replace %run
in the command above with python
or appropriate call for your python on your own command line.)
%run pdb_ids_to_stats_and_info_df.py pdb_ids.txt
# read in the dataframe produced
import pandas as pd
df_ids = pd.read_pickle("stats_and_info_for_PDBids_pickled_df.pkl")
df_ids
Parsing details from PDBsum ... Information for specified structures read and converted to a single dataframe... A dataframe of the data has been saved as a file in a manner where other Python programs can access it (pickled form). RESULTING DATAFRAME is stored as ==> 'stats_and_info_for_PDBids_pickled_df.pkl'
PDB id | Resolution | R value | Ligands | |
---|---|---|---|---|
0 | 1wsv | 2.60 Å | 0.171 | SO4, THH |
1 | 1eve | 2.50 Å | 0.188 | NAG-NAG, NAG, E20 |
2 | 1btn | 2.00 Å | 0.205 | I3P |
3 | 1trn | 2.20 Å | 0.177 | ISP |
(%run is special to notebooks; you'd replace %run
in the command above with python
or appropriate call for your python on your own command line.)
It is probably wise to make sure it works with the demonstration data first and then replace the PDB codes in pdb_ids.txt
or upload your own file listing PDB code ids and point the script at it with one of the options demonstrated here.
You could do that in a notebook by importing the main function of the script and pointing it at the file listing the PDB codes.
from pdb_ids_to_stats_and_info_df import pdb_ids_to_stats_and_info_df
df = pdb_ids_to_stats_and_info_df("pdb_ids.txt")
df
Parsing details from PDBsum ... Information for specified structures read and converted to a single dataframe... A dataframe of the data has been saved as a file in a manner where other Python programs can access it (pickled form). RESULTING DATAFRAME is stored as ==> 'stats_and_info_for_PDBids_pickled_df.pkl' Returning a dataframe with the information as well.
PDB id | Resolution | R value | Ligands | |
---|---|---|---|---|
0 | 1wsv | 2.60 Å | 0.171 | SO4, THH |
1 | 1eve | 2.50 Å | 0.188 | NAG-NAG, NAG, E20 |
2 | 1btn | 2.00 Å | 0.205 | I3P |
3 | 1trn | 2.20 Å | 0.177 | ISP |
It is probably wise to make sure it works with the demonstration data first and then replace the PDB codes in pdb_ids.txt
or upload your own file listing PDB code ids and point the script at it with one of the options demonstrated here.
You may want to get a sense of what else you can do with that dataframe by examining the previous notebook in this series, Demonstration of PDBsum ligand interface data to dataframe script.
Next I'll demonstrate how to save it as text for use elsewhere, such as in Excel. Or even make a spreadsheet file in the Excel format directly.
If you do look at the previous notebooks in this series, you'll see that I've tried to sell you on the power of the Python/Pandas dataframe, but it isn't for all uses or everyone. However, most everyone is accustomed to dealing with text based tables or even Excel. In fact, a text-based based table perhaps tab or comma-delimited would be the better way to archive the data we are generating here. Python/Pandas makes it easy to go from the dataframe form to these tabular forms. You can even go back later from the table to the dataframe, which may be inportant if you are going to different versions of Python/Pandas as I briefly mentioned parenthetically above.
First, generating a text-based table.
#Save / write a TSV-formatted (tab-separated values/ tab-delimited) file
df.to_csv('pdbsum_info.tsv', sep='\t',index = False) #add `,header=False` to leave off header, too
Because df.to_csv()
defaults to dealing with csv, you can simply use df.to_csv('example.csv',index = False)
for comma-delimited (comma-separated) files.
You can see that worked by looking at the text file made with the next command.
!cat pdbsum_info.tsv
PDB id Resolution R value Ligands 1wsv 2.60 Å 0.171 SO4, THH 1eve 2.50 Å 0.188 NAG-NAG, NAG, E20 1btn 2.00 Å 0.205 I3P 1trn 2.20 Å 0.177 ISP
If you had need to go back from a tab-separated table to a dataframe, you can run something like in the following cell.
reverted_df = pd.read_csv('pdbsum_info.tsv', sep='\t')
reverted_df.to_pickle('reverted_df.pkl') # OPTIONAL: pickle that data too
For a comma-delimited (CSV) file you'd use df = pd.read_csv('example.csv')
because pd.read_csv()
method defaults to comma as the separator (sep
parameter).
You can verify that read from the text-based table by viewing it with the next line.
reverted_df
PDB id | Resolution | R value | Ligands | |
---|---|---|---|---|
0 | 1wsv | 2.60 Å | 0.171 | SO4, THH |
1 | 1eve | 2.50 Å | 0.188 | NAG-NAG, NAG, E20 |
2 | 1btn | 2.00 Å | 0.205 | I3P |
3 | 1trn | 2.20 Å | 0.177 | ISP |
Generating an Excel spreadsheet from a dataframe.
Because this is a specialized need, there is a special module needed that I didn't bother installing by default and so it needs to be installed before generating the Excel file. Running the next cell will do both.
%pip install openpyxl
# save to excel (KEEPS multiINDEX, and makes sparse to look good in Excel straight out of Python)
df.to_excel('pdbsum_data.xlsx',index = False,) # after openpyxl installed
Requirement already satisfied: openpyxl in /srv/conda/envs/notebook/lib/python3.7/site-packages (3.0.9) Requirement already satisfied: et-xmlfile in /srv/conda/envs/notebook/lib/python3.7/site-packages (from openpyxl) (1.1.0) Note: you may need to restart the kernel to use updated packages.
You'll need to download the file first to your computer and then view it locally as there is no viewer in the Jupyter environment.
Adiitionally, it is possible to add styles to dataframes and the styles such as shading of cells and coloring of text will be translated to the Excel document made as well. That is covered elsewhere in resources referenced by other notebooks in this series.
Excel files can be read in to Pandas dataframes directly without needing to go to a text based intermediate first.
# read Excel
df_from_excel = pd.read_excel('pdbsum_data.xlsx',engine='openpyxl') # see https://stackoverflow.com/a/65266270/8508004 where notes xlrd no longer supports xlsx
That can be viewed to convince yourself it worked by running the next command.
df_from_excel
PDB id | Resolution | R value | Ligands | |
---|---|---|---|---|
0 | 1wsv | 2.60 Å | 0.171 | SO4, THH |
1 | 1eve | 2.50 Å | 0.188 | NAG-NAG, NAG, E20 |
2 | 1btn | 2.00 Å | 0.205 | I3P |
3 | 1trn | 2.20 Å | 0.177 | ISP |