#!/usr/bin/env python
# coding: utf-8
# # Guide to accessing and reviewing the bendit results
#
# Potentially, a lot of data can be generated when the demo pipeline is run, and steps were taken to make handling all that as easy as possible. However, accessing that collected data may not be obvious to the uninitiated. This notebook covers what the 'realistic' pipeline [here](index.ipynb) produces and how to access it. Furthermore, it touches on how to use Jupyter/Python to conveniently view and further process all the data.
#
# The hope is this covers what is necessary to use the generated output. Therefore, if you just plugged your sequences into the demo pipeline, you should be able to follow along to get the bendIt results for all you data. If you are looking to adapt the pipeline, this notebook will give you a flavor of the types of output you may wish to gather or not include your adapted bendIt pipeline.
#
# ### Navigating this Jupyter notebook page
#
# Because this notebook covers all the possible data types from the bendIt analysis run and various ways to access the information, it is rather long. To make it easier to navigate you can click on the 'Table of Contents' icon ![svg image](data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20xmlns%3Axlink%3D%27http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%27%20version%3D%271.1%27%20width%3D%2724%27%20height%3D%2724%27%20viewBox%3D%270%200%2024%2024%27%3E%3Cpath%20fill%3D%27%23616161%27%20d%3D%27M7%2C5H21V7H7V5M7%2C13V11H21V13H7M4%2C4.5A1.5%2C1.5%200%200%2C1%205.5%2C6A1.5%2C1.5%200%200%2C1%204%2C7.5A1.5%2C1.5%200%200%2C1%202.5%2C6A1.5%2C1.5%200%200%2C1%204%2C4.5M4%2C10.5A1.5%2C1.5%200%200%2C1%205.5%2C12A1.5%2C1.5%200%200%2C1%204%2C13.5A1.5%2C1.5%200%200%2C1%202.5%2C12A1.5%2C1.5%200%200%2C1%204%2C10.5M7%2C19V17H21V19H7M4%2C16.5A1.5%2C1.5%200%200%2C1%205.5%2C18A1.5%2C1.5%200%200%2C1%204%2C19.5A1.5%2C1.5%200%200%2C1%202.5%2C18A1.5%2C1.5%200%200%2C1%204%2C16.5Z%27%20%2F%3E%3C%2Fsvg%3E%0A) on the left sidebar to bring up a table of contents in the left panel.
#
# You can get a sense of how to use the Table of Contents to navigate a notebook from the animation below, despite the fact the animation features another notebook.
#
#
#
#
# When you need to navigate the directory structure or access files in the file browser, switch back to the file browser by clicking on the 'folder' icon
# on the top of the left sidebar.
#
# ## Saving files or bundles of files from the session to local
#
# This notebook includes ways to make various files. You'll need to download those from this **temporary, remote session** to your local computer.
#
# To make this easier once you start to generate a lot of files, the ability to right-click on any directory in the file navigator and choose to save it as an archive has been added. It will resemble the following animation:
#
#
#
# For example, if you unpack the archive as instructed below you may end up making things in `unpack` and want that entire directory. Therefore in the file navigator panel on the left, you'd go to a point in the directory hierarchy above that `unpack` directory and right-click on it and choose `Download as an archive`. You can make a new directory any time using the 'New folder' icon above the left panel and then use drag and drop in the file navigator to move any files you want into it.
#
# The archive defaults to being `.gzip` format. You can alter that format if you'd prefer by going to the JupyterLab menu bar along the top of your browser window and selecting `Settings` > `Advanced Settings Editor`, and then selecting `Archive` in the `Settings` window that comes up in the main window. There you can edit the '
# `User Preferences` on the right to override the system default. For example maybe you'd prefer `tar.gz`, and so you'd edit the contents of `User Preferences` to be `{"format": "tar.gz"}`. Be sure to click on the disk icon in the upper right to save your new settings.
# ## Accessing the bendIt results: Starting up an active session
#
# If you already have an active Jupyter session going, you can skip this step.
#
# However, if you have a previously saved result from the demo pipeline and you are now returning to it and looking to access the contents of the archive, the best way to get started is to start a session by going [here](https://github.com/fomightez/bendit-binder) and clicking on the `launch bend.it` badge. When the session starts, select the `Accessing_and_reviewing_the_bendit_results.ipynb` notebook from the navigation panel at the right and work through the steps below this.
#
# The uncompressing step below should work on an unix-style command line. However, some of the advanced steps under the section entitled 'Using Python to review the data or customize the plots' involve modules/packages that aren't necessarily in a standard Python installation. By running in the recommended environment as directed above, the code is more likely to work without hiccups.
# ## Accessing the bendIt results: Uncompress the Archive
#
# This notebook assumes, you just ran the pipeline and are trying to access the results. There is then a number of files in the current directory beside the archive. To make things easier, I am going to suggest making a new directory with a simple name and then you could drag the archive into that folder and work there. To make things match witih the demo, I am going to write out those steps as commands, too. Feel free to run the cells or to do it by hand. If you make a different 'unpack' directory, you'll need to adjust things below accordingly.
#
# The exclamation marks in front of the shell commands, tells Jupyter to run those commands as shell commands and not Python.
# In[1]:
get_ipython().system('mkdir unpack')
get_ipython().system('mv bendit_analysis*.tar.gz unpack/')
# To make things in this notebook work, you'll need to switch the current working directory over to where we are going to unpack the archive. We'll use the Jupyter magic command `%cd` to change the working directory for all subsequent cells in the notebook. If we just used `!cd`, it would only change the directory for that cell.
# In[2]:
get_ipython().run_line_magic('cd', 'unpack')
# Note you can check anytime what is the current working directory with `pwd`. (Note most shell commands need an exclamation point but a few were added in to Jupyter so they work without it and `pwd` is one.) Also of note, is that this location is completely independent of where the file navigation pane on the left side of this window may be showing.
# In[3]:
pwd
# To **unpack the archive**, we'll run the command in the following cell.
# You need to edit the command so it will extract your file. In other words, the part after the `xzf` has to be changed to match the actual file name of the archive you are working with in particular. To make it easier you should be able to click after the default code in the cell below and hit `Tab` button on your keyboard and Jupyter will auto-complete the file name for you after a slight pause.
#
# (Note that I am keeping this untarring step as generic as possible so that it would work as written on any unix-style command line without the exclamation point.)
# In[4]:
get_ipython().system('tar xzf bendit_analysis')
# If you ran the above cell and saw anything like `tar (child): bendit_analysisZZZZZZZZZ.tar.gz: Cannot open: No such file or directory` (where `ZZZZZZZZZ` is used to represent the date time stamp), it simply means you had the file name wrong. You'll need to edit it and run the cell again.
#
# If things worked, you should just see the asterisk to the left of the cell turn to a number. If you changed the file navigation pane on the left-side of this browser window over to the 'unpack' directory, after a moment you should see more files show up in the file pane. Don't worry if you didn't switch. We are going to explore the contents using commands next anyways.
# ## Overview of the Unpacked Items
#
# If all is correct and you used the `%cd` command to previously switch to the 'unpack directory, running the next cell will show the contents of the current working directory so we can begin to explore the contents of the-now-unpacked archive.
# In[6]:
ls -lh
# That lists out the contents of the directory. The options added along with the `ls` command make the output more redable by showing. In particular, the `h` in `-lh` means the file sizes are human readable and the `l` in `-lh` says to list the details in long form and not just the names.
#
# You'll see there is much more than the compressed archive that was originally added when we made the directory just a few cells back. These are the results of the bendIt run. The following section will go through what these are and how to use them.
#
# ----
# ## About the Contents
# If you used your own data, your results will be different but the types will be the same. I am going describe the contents as if you ran the demo sequences; however, mostly you just need to pay attention to the file extensions, or sometimes the start of the name, to tell which file types correspond.
#
# After a brief overview, I am going to add some details about most of the types.
# Typical contents overview:
#
# - Log file
# - Image files for the plots
# - a Jupyter notebook for reviewing all the plots en masse
# - serialized (pickeled) bendIt data
# - bendIt data as tabular text
# - Nucleotide composition detail
# - serialized processing information and results on a per sample basis
# - input files
# - raw gnuplots
#
# ### Log file
#
# The Log file will look something like `LOG_baZZZZZZZZZ.txt` with the `` showing the abbreviation for the month and the `ZZZZZZZZ` portion being derived from a time date stamp.
#
# This contains much of what was shown as the demo pipeline ran with some additional details from the actualy bendIt job(s) for each sequence.
#
# At the end is a summary. *When `lightweight_archive` is true, only the summary is in the Log file.*
# ### Image files for the plots
#
# The plots have been saved a two forms of images. The first part of the name of each is derived from the sample_set name and the individual sample name. The two extensions delineate them:
#
# - `.png`
# Examples from the demo input `demo_A.png` and `demo_B.png`.
# This is a raster/bitmap file format made of pixels that you are probably familair with. While it is great for viewing easily as a lot of software, including JupyterLab, can handle it, please see the note below suggesting `.svg` for scaling up, or the Python section that discusess making individual plots larger and making new image files in `.png` format from that.
#
# - `.svg`
# Examples from the demo input `demo_A.svg` and `demo_B.svg`.
# This indicates SVG (scalable vector grpahics) file format. SVG is really the best choice for scaling up or adapting further as it offers the most control and no less of resolution. Sugggest using Adobe Illustrator or Inkscape for scaling and customizing. This is what you'll want to use if you don't want to remake the plot and are looking to customize it for publication. Any modern browser can view `.svg` files, and fortunately an SVG viewer is built right into JupyterLab.
#
# When `lightweight_archive` or `lightweight_with_images` is true, these image files are not saved as part of the archive. In that case, you'll want to see sections below to make the image files 'after-the-fact' from the tables of data for each sequence.
# ### Jupyter notebook for reviewing all the plots en masse
#
# The file name for this file will resemble `plots4review_from-baZZZZZZZZ.ipynb` with the time data stamp matching the arhive file name and log file name.
#
# The image files dsicussed above are nice but not that easy to view unless you bring them local and use your file browser. Alternatively, you can browser them right in the session by opening this notebook and then opening subsequent views of it. (**ADD MORE DETAILS ON HOW TO OPEN MULTIPLE VIEW AND ADD IMAGE EXAMPLE**)
#
# At the top of this notebook, I emphasized you'll want to be in an active notebook session for working with the output. One of the main reasons is that it offers a standard environment where the archive can be unpacked easily and Jupyter offers nice viewers for many of the data types. This is the case for the 'Review' notebook. The plots are already part of the notebook, and so nothing has be run again, but a Jupyter environment is useful for viewing it. Alternatively, nbviewer can be used to view a 'static' from of the notebook if you don't mind placing the notebook file somewhere [the online nbviewer](https://nbviewer.jupyter.org/) can be pointed at it. Note the static form will look much like it does in Jupyter but you cannot modify or run any cells, or modify the text content further.
#
# Keep in mind for sharing this notebook with anyone not fluent in Jupyter notebook use, you can use `File` > `Export Notebook As..` > `Export Notebook to PDF` to also generate a PDF and then download that file as well.
#
# Note that the sample names on the plots should look like the original input sample names but might not match sample names seen in the 'serialized processing information and results on a per sample basis' file, `seqs_dfs_and_plots_for_each_set.pkl`. This is because certain characters can an issue with the processing steps and were eliminated from the sample names for processing. In the plot representations, efforts were made to substitute back in names matching a pattern used by a user.
#
# When `lightweight_archive` or `lightweight_with_images` is true, the Jupyter notebook displaying all the plots is not saved as part of the archive. In case of that there are several options on how to access similar content. The one I suggested in the Jupyter notbeook running the demonstation pipeline was to save the notebook or export a PDF of the notebook to your local computer. However, using the contents of `seqs_dfs_and_plots_for_each_set.pkl` this notebook could be made after-the-fact. That is described in the section 'Making the review Jupyter notebook after-the-fact' as part 'Using Python to review the data or customize the plots' below.
# ### Serialized (pickled) bendIt data
#
# This will resemble files looking like `demo_A.pkl`.
#
# The data plotted from the bendIt analysis is stored in a compressed form that can easily be read back in as a Pandas dataframe for convenient use in the Jupyter environment or further analysis. Acessing these Pandas dataframes in the Jupter environment will be discussed below as part of the 'Using Python to review the data or customize the plots' section.
# ### bendIt data as tabular text
#
# This will resemble files looking like `demo_A.tsv`.
#
# The data plotted from tbe bendIt analysis is stored in a tab-delimited tabular text form that can easily be used in standard spreadsheets, such as Excel or Google sheets. Jupyter allows easy viewing of these as well. You can click on the 'frame' symbol next to the file name in the file navigto panel on the left, and they'll open as full-featured spreadsheet-like views. If you right click, and select `Open with ...` > `editor`, you can see the text form that underlies it.
#
# ### Nucleotide composition detail
#
# These files will look like sample set names followed by `_cassettesGC.pkl` and `_cassettesGC.tsv` for the cassette sequences and sample set names followed by `_mergedGC.pkl` and `_mergedGC.tsv for the sequences of the combined cassette sequences merged with the defined flanking sequences.
#
# The examples from the demonstration data are:
#
# - `demo_cassettesGC.pkl`
# - `demo_cassettesGC.tsv`
# - `demo_mergedGC.pkl`
# - `demo_mergedGC.tsv`
#
# These provide a breakdown of nucleotide composition and %G+C for every cassette sequence. In the serilaized (pickeled) data, it is dataframe with each sample as a row. In the tabular text data, it is a tab-separated file with each sample as a row. The rank of %G+C from lowest to highest is the final column.
#
# Similar data is provided for every merged sequence where cassette flanked by the defined sequences.
#
# Accessing the tabular text is similar to what is described under 'bendIt data as tabular text'.
#
# Accessing the serialized Pandas dataframes in the Jupter environemnt will be discussed below as part of the 'Using Python to review the data or customize the plots' section.
#
# The source dataframes are also included in the 'serialized processing information and results on a per sample basis' file, `seqs_dfs_and_plots_for_each_set.pkl`.
# ### Serialized processing information and results on a per sample basis
#
# This file will be named `seqs_dfs_and_plots_for_each_set.pkl`.
#
# This is almost all the input sequences and output stored on a per sample set and per sample basis using Pyton dictionaries. This is mainly meant for advanced use. It actually gets used in the top of the `Notebook for reviewing all the plots en masse` to render all the plots. It will be used to access the dataframes as well as an example below. Really beyond being used to easily render all the plots, it is really just there to have it in case something not collected here is necessary or needs to be checked.
#
# In the `seqs_dfs_and_plots_for_each_set.pkl`, there is a value for each sample set. That value is a list of dictionaries, or in two cases a dataframe. The order of the items for each sample set are as follows:
#
# - cassette sequences processed keyed on name
# - sequences of the cassette sequences merged to the defined flanking sequences processed keyed on name
# - breakdown of nucleotide composition and %G+C for every cassette sequence (dataframe with each sample as a row)
# - breakdown of nucleotide composition and %G+C for every merged sequence where cassette flanked by the defined sequences (dataframe with each sample as a row)
# - dataframes produced by bendIt analysis & used to make the plots keyed by sample
# - plots as Python objects produced from each dataframe keyed by sample (**not saved in 'lightweight' settings**)
#
# As the information is stored serialized (pickled), it has to be unpickled first. The following code can be used to do unpickle and bring it into an active Jupyter notebook namespace:
#
# ```python
# import pickle
# with open("seqs_dfs_and_plots_for_each_set.pkl", "rb") as f:
# seqs_dfs_and_plots_per_sample_set = pickle.load(f)
# ```
#
# You'll note that this code gets used in the notebook for reviewing the plots en masse as it is from this source that the plots were added to the review notebook.
#
# ### Input files
#
# The sequence files of the cassette sequences were stored in FASTA format in the archive as well in order to provide a more an intact artifact of all stages of the run.
#
# As they are in FASTA format, they'll most likely end in `.fa` or `.fasta`.
#
# The input files may also exist in sanitized and unsanitized form if the sample names included characters that would have caused issues in the processing steps.
#
# ### Raw gnuplots
#
# These files will look like sample names followed by `_output.png`.
#
# The raw gnuplots made by bendIt for each sample are saved depending on the setting of `include_gnuplots`, which is moot in case `lightweight_archive` is `True`. If `lightweight_archive` is set to true, or if `include_gnuplots` is `False` in the case where `lightweight_archive` is not true, these 'raw' gnuplots won't be included in the archive.
#
# Overall these should resemble the plots generated for each cassette merged into the flanking sequences analyzed and are only meant for verification of that and troubleshooting. Keep in mind that for short sequences, the right side of the gnuplot will not represent the correct form as extra sequences were added to avoid a segmentation fault when bendIt was run.
# ----
#
#
# ----
#
# ## Using Python to review the data or customize the plots
#
# A number of options readily exist for using and further processing the unpacked data. This section illustrates some of those. A good portion at the end focuses on accessing the data in the 'serialized processing information and results on a per sample basis' file, `seqs_dfs_and_plots_for_each_set.pkl`. That portion is probably best considered 'advanced' as it requires more understanding of Python syntax to fully understand what is going. Most of the other examples only require name changes to get to data.
#
# ----
#
# ### Viewing dataframes for the plotted data
#
# While JupyterLab adds big improvements for viewing the CSVs/TSVs derived from (or which can be the source of) such dataframes, viewing and handling the dataframes within Jupyter is going to come up. Jupyter is particularly good at rendering Pandas dataframes nicely.
#
# This section will describe accessing the serialized (pickled) bendIt data in files looking like `demo_A.pkl`. This is mainly to serves as an introduction to illustrate the process and handling dataframes.
#
# If you are looking to use the bendIt data that is actually plotted elsewhere, you'd probably want what is described above as 'bendIt data as tabular text' which is in the files looking like `demo_A.tsv`. Where there sample set and sample names are in te file. The dataframe form is nice if you are continuing on with using Python to examine.
#
# Let's illustrate viewing a dataframe in Jupyter by bringing the serialized form in.
#
# It would be fairly easy to bring in the TSV form to a dataframe as well, but the serialized just needs to be read in at this point.
#
# As the command shows in the next cell, we can just put the file name of the serialized dataframe in a command like `df = df.read_pickle('demo_A.pkl')` where you'd replace `demo_A.pkl` with your file name of interest.
# In[7]:
import pandas as pd
df = pd.read_pickle('demo_A.pkl')
# To view a representation of the dataframe, we can call it now since we defined it as `df` above:
# In[8]:
df
# We could just look at the top.
# In[17]:
df.head()
# Or look at the end:
# In[18]:
df.tail()
# Or easily view an overview of the data contents.
# In[19]:
df.describe()
# I include a more detailed introduction to dataframes [here](https://nbviewer.jupyter.org/github/fomightez/blast-binder/blob/master/notebooks/BLAST%20on%20Command%20Line%20and%20Integrating%20with%20Python.ipynb#Demonstrating-the-Utility-of-Having-the-BLAST-Results-in-Python) and [here](https://nbviewer.jupyter.org/github/fomightez/ptmbr-accompmatz/blob/master/notebooks/PatMatch%20with%20Python%20basics.ipynb#Demonstrating-the-Utility-of-Having-the-PatMatch-Data-in-Python).
# ----
#
# ### Understanding sample names and how serialized information keyed
#
# This section will introduce accessing the contents in the 'serialized processing information, `seqs_dfs_and_plots_for_each_set.pkl` on a per sample basis to get a sense of how the infromation is keyed.
#
# The next cell unpickles the `seqs_dfs_and_plots_for_each_set.pkl` as described under the 'Serialized processing information and results on a per sample basis' section.
# In[ ]:
import pickle
with open("seqs_dfs_and_plots_for_each_set.pkl", "rb") as f:
seqs_dfs_and_plots_per_sample_set = pickle.load(f)
# The sample set labels are the overarching keys in this collection.
# In[ ]:
seqs_dfs_and_plots_per_sample_set.keys()
# There is only one sample set in the demonstration set as there is only one sequence file. The name of that sample set is `demo` for the demonstration sequence file.
#
# As described under the section 'Serialized processing information and results on a per sample basis', there are several items for each sample and for all except two, those items are further keyed on a per sample basis. (The '%G+C brekadown' dataframes don't have further 'keyed' contents per se, but they have each sample as a row.) For example, the first listed item is 'cassette sequences processed keyed on name'. And so to access that item, we'd use the following:
# In[ ]:
seqs_dfs_and_plots_per_sample_set['demo'][0]
# The zero in the brackets at the end specifies the first item in Python since the language is zero-indexed. In other words, the first item in a list in Python is referenced by the index 'zero'; whereas, it is generally common to number the first item as 'one'.
#
# That first item is actually a dictionary of keys and corresponding values. The dictionary keys are the sample names as can be see by listing the keys using the folllwing code:
# In[ ]:
seqs_dfs_and_plots_per_sample_set['demo'][-1].keys()
# We see the labels for sample sequences 'A' and 'B' that were present in `demo_sample_set.fa`.
#
# If we want to see the corresponding value, which in the case of this dictinary is a nucleotide sequence, we can use a particular key to access it. For example:
# In[ ]:
seqs_dfs_and_plots_per_sample_set['demo'][-1]['B']
# Using that code, the output of that cell above shows the cassette sequence corresponding to sample sequence 'B'.
#
# Note when you see the sample names, there is a chance they may not correspond to the specific text you specified in your FASTA file. This is because certain characters, such as `|` along with parantheses and slashes don't play well with processing approach used here, and so if the text for the sample names contained those in the FASTA description, they have been sanitized to avoid issues. (Note for a certain pattern of labels, the names are corrected in the displayed plot titles.)
#
# We'll use similar approaches below to access some of the other items that are stored for each sample set. For example, in the sub-section 'Accessing the nucleotide composition/%G+C information via Python', we'll see `seqs_dfs_and_plots_per_sample_set['demo'][2]` used to access the third item in the serialized data, which is the dataframe with a breakdown of the nucleotide composition and the %G+C of the cassette sequences.
# ### Looking at specific plot and adjusting the plots via Python
#
# If you used the lightweight setting and want to generate image files for all the plots, see 'Generate image files for all plots via Python' below. This section is meant for when you are just interested in the image files for a few plots.
#
# If you are looking to scale or edit just a few of the plots, I suggest you use the `.svg` file with the images as scaleable vector graphics.
#
# However, maybe you used the `lightweight_archive` setting previously and want to generate images from the plots now or you want to make the plots larger. This section will illustrate accessing the contents in the 'serialized processing information and results on a per sample basis' file, `seqs_dfs_and_plots_for_each_set.pkl` to bring the plots back into the session here so image files can be saved or the image dimensions altered.
#
# The next cell unpickles the `seqs_dfs_and_plots_for_each_set.pkl` as described under the 'Serialized processing information and results on a per sample basis' section.
# In[5]:
import pickle
with open("seqs_dfs_and_plots_for_each_set.pkl", "rb") as f:
seqs_dfs_and_plots_per_sample_set = pickle.load(f)
# The sample set is the first key in this collection. That is `demo` for the demonstration sequence file.
#
# See the sub-section entitled 'Understanding sample names and how serialized information keyed' above. for more information on that
# As described under the section 'Serialized processing information and results on a per sample basis', the plots are the last item in the serialized collection. As the last item it can be accessed using a special shortcut to indicate the last item ,`-1`. And so to access the plot collection for the demo set the code is:
# In[6]:
seqs_dfs_and_plots_per_sample_set['demo'][-1]
# The cryptic code, `