bendIt
running in Jupyter¶The standalone version of bendIt
is installed and ready to be run in this launched Jupyter environment. (If you are viewing this notebook in static form, you should 'launch' it in active form from here.
If you want the basic usage block and an example of using it on the command line, go to this notebook.
This notebook will illustrate a realistic workflow where a number of sequences in a multi-sequence fasta file are analyzed with bendIt
. This should serve to show the advantage of using the standalone version of bend.it over the bend.it Server for processing more than a few sequences, and touch upon some of the benefits of having bendIt
working in the Jupyter environment.
If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.
Some tips:
This demonstration scenario will analyze the properties of a series of sequence cassettes flanked by defined sequences. The curvature and bendability predictions for each resulting sequence will be plotted.
Click on and drag a file listing sequences in FASTA format from your computer into the file browser window to the left of this text.
When the file is correctly dragged into the pane, a dashed, gray outline will appear and you can release your mouse button.
TO RUN THE DEMO WITH A PROVIDED FASTA FILE: DON'T DRAG ANYTHING IN AND JUST GO AHEAD AND RUN THE CELLS BELOW. IT WILL USE THE FASTA FILE FROM THE bendit test folder ALREADY PRESENT HERE IF YOU DON'T UPLOAD ONE.
Change the file extension to .fa
or .fasta
(or even "faa", "fas", "fsa" work), if it isn't already. To do that right-click on the file name in the file navigation panel to the left, and select Rename
.
Also, remove any spaces in the file name; an underscore makes a nice replacement.
You can also drag in more FASTA files and each one will be processed and treated as a separate sample set.
Run the following code cells to process the sequence(s) to make the plot(s). Change any settings you need to as described.
There are three ways to run a cell if you are not familiar with the JupyterLab interface.
You can run the cell by clicking on it and pressing the run
button, shaped like a triangle heading towards the right, that is on the utility bar above this notebook.
Click on the cell to run to select it, and then under Run
menu above, choose Run Selected Cells
Click on the cell to run to select it, and type Shift-Enter
. Which is holding down the shift key wille pressing the enter key.
bendIt
SETTINGS.¶There are several options that can be set for running bendIt
. They come with inherent defaults that can be seen by running !bendIt --help
as a cell in this notebook.
Several of these are set to alternative settings using the following assignments.
Edit the text in the cell below to better reflect your choice of window for analysis, if you prefer.
window_size = 3
In addition to predicting curvature, bendIt
will also report G+C content, complexity, or a prediction of bendability. Here, we set this to bendabilty
. To change, edit text between the quotes to G+Ccontent
or complexity
.
report_with_curvature = "bendability"
For the analysis a series of sequences defined in the FASTA file will be analyzed in the context of defined flanking sequences. The cell below assigns the sequences that flank the sequence cassette that will be substituted succesively and anlayzed. If instead of this swappable cassette scenario, you already have a multi-sequence fasta file containing the sequences you wish to analyze, you can change the bracket sequences to nothing by changing the settings for the bracket sequenes to the following in the cell below:
#if you don't want defined flanking sequences added, use this code:
up_bracket = ""
down_bracket = ""
up_bracket = "gtaaaacgacggccagcatggaggtacaa"
down_bracket = "gggaggtacttccatggtcatagctgtt"
The script that will run here will use the file name of the uploaded multi-sequence FASTA file to determine a 'sample set name' to refer to the entire set of sequences and corresponding results. The text in the file name prior to occurence of the character_to_mark_name_end
that is defined below, default is an underscrore, will be used as the name designation of the sample set. If you want to change the delimiter, edit character_to_mark_name_end
in the pertient cell appearing in this section below.
If you'd like to override that process altogether and designate a specific name yourself, then edit the following cell so that sample_set_name_extract_auto
is instead assigned to False
. And then add a line below it where you assign sample_set_name
to the name you want to specify, like the following where you'd replace Name_here
with text to actually use:
sample_set_name_extract_auto = False
sample_set_name = "Name_here"
(Because input data provided by users wishing to use standalone bendIt version had slightly deviated from best practices of data handling by mixing the overarching label for the sample set in with the entries in the multi-FASTA file, the script will check if the label for the sample set has been placed as the first line above the listing of sequence entries and use that as the sample set name, in that case. And, so that is an alternative way to provide the sample set name.)
sample_set_name_extract_auto = True
character_to_mark_set_name_end = "_"
By default, the first 'word' in the description line of each cassette sequence will be used as the individual sample name. If you'd rather specify a different delimiter for the individual sample names, then change on the following line what is in the quotes to alter the character_to_mark_individual_sample_name_end
setting.
character_to_mark_individual_sample_name_end = " "
In order to accommodate a user's request to make the resulting plots look similar to what is seen with Excel using the setting to graph 'Smooth Lines', the plots are 'smoothed'. If you don't want that, set smooth_plot_curves
to False
on the line below.
smooth_plot_curves = True
In order, to accommodate a user's request to allow a complex sample naming scheme in the description line, there is a sanitization step early on in order to avoid issues in processing files with /
,|
, and parantheses in bendIt, and then later the script tries to substitute back in the offending characters in the plot title, relying on the pattern. Every effort is made to make it restrict to match the presumed user pattern; however, set show_date_with_slashes_in_plot_title
to False
on the line below if you are seeing dash characters to show up as forward slashes in your plot title or if your underscores are showing as |
or if +
are being converted to parantheses. Likewise, if you are using the offending characters as part of the text that becomes the sample names, you can edit the script conditional code block that begins if show_date_with_slashes_in_plot_title
to reverse the sanitizing step for the plot title by following the built-in example as a guide.
show_date_with_slashes_in_plot_title = True
The pipeline here generates plots of the data using Python. This is meant to make plots with modern features and aesthetics from each analyzed sequence without the need for further post-processing. Those familiar with using the bendit server may know that you can choose to have the server site output 'raw' gnuplot-generated plots of the data. The standalone version running here generates those 'raw gnuplots' by default. This setting below keeps the 'raw gnuplot' output as part of the collected output. However, once you establish the pipeline results in the same plots as the bendit server, you may wish to not collect these plots as part of the archived output in order to keep file sizes smaller. You can then change the following cell to read:
include_gnuplots = False
It is purposefully set to True
below by default to encourage you to verify your data indeed gives the same results before and after any post-processing in this pipleine, and for direct comparison to the bendit server produced plots.
include_gnuplots = True
If you are processing a lot of sequences, you'll find with these default settings you generate a large output archive. For example, normally processing thirty 75 nt sequences results in archive of 5 Mb. Processing a hundred 75 nt sequences results in an archive of 17 Mb normally. There are two 'lightweight' options to help ease issues as you scale up squences analyzed: lightweight_archive
setting and lightweight_with_images
setting . You can set the lightweight_archive
to True
in other to make the output archive much smaller. As a gauge of the degree of space that can be saved, the archive size of the examples drops to 2 Mb and 6.9 Mb, respectively.
lightweight_archive = False
It is set to False
by default. If you are finding you are generating archives that are hard to deal with you'll want this set to True
. In addition, to leaving out the 'raw' gnuplots (regardless of the include_gnuplots
setting), it doesn't save a notebook with each plot for convenient review and doesn't save every plot as images. Both of those typical by-products can still be made from the contents of the archive. The idea being that with a lot of sequences you may wish to save this notebook and the streamlined archive and then generate plot image files for only the bendIt results that are really of interest. The LOG file will only contain the end summary with the lightweight_archive
setting.
In addition, to the lightweight_archive
setting there is a related lightweight_with_images
setting. This setting acts as the lightweight_archive
setting, with the addition of making the plot images as the individual sequences are processed. The plot images are still not included in the archive.
lightweight_with_images = False
If lightweight_with_image
is set to true, it takes priority over whatever lightweight_archive
is set to, so that you only have to worry about setting one or none of the 'lighweight' options.
Set cleaning_step = False
in the cell below to skip the 'clean up' step at the end. The 'clean up' step is a final step after the archive file is made that deletes all the individual files that had been made in the process of the run and are now cluttering the working directory. This setting is intended for use in conjunction with either 'lightweight' setting because if you've run a lot of sequences, you may be willing to forego the convenience of being able to find the archive among the long list of files in favor of more quickly getting to the moment you can download the archive file.
cleaning_step = True
Now with the sequences uploaded and the settings assigned, you are ready to begin the analysis by bendIt
.
bendIt
.¶To start analyzing the sequences, run the next cell. This will take some time to run; however, feedback will be provided at several steps. When completed, results will be shown and below you'll be given options to collect the produced data.
(For those interested in making a custom workflow that uses bendIt, you'll want to examine the script to try to adapt it to your needs.)
%run -i bendIt_analysis.py
ls -lh
CLEANING TAKES TIME. WAIT UNTIL CONTENTS OF DIRECTORY ARE LISTED ABOVE TO MOVE ON.
I put a cell listing of the contents of the directory in the cell above because cleaning up takes a while and otherwise seems like ready to go to Step #4 when actually not ready.
If everything looks like it ran well above, and the listing of the files in the directory in the cell just above is showing. Then you should be almost set. Although many files are produced in the course of running the previous step, in the end there should just be one archive file resulting. The name if it will look something like the following with the <MONTH>
text in the time and date stamp replaced by an abbreviation reflecting the current month:
bendit_analysis<MONTH>1420201914.tar.gz
You should not need to save this particular notebook. EVERYTHING NEEDED SHOULD BE IN THE ARCHIVE.
A notebook ready to review the plots you see generated above has been saved as part of the packaged archive along with a log file of the run. Additionally, separate images of the plots are saved in the packaged archive and both Python-based and text forms of the data produced in the analysis.
Save the archive to your local computer.
Save the archive to your own computer by right-clicking on it in the file navigation panel on the right, and selecting Download
. After a pause, a typical file download window should show up allowing you to save the file to your own computer.
There is a big caveat covering the statements above about the archive containing all that you need. If you set either the lightweight_archive
or lightweight_with_images
setting to True
, then a separate notebook for reviewing the plots won't be saved as part of the archive, and no image files will be saved as well. As the separate notebook meant for reviewing the plots won't be stored in the archive, you may wish to save and download this notebook to your local computer. All the plots should be visible under Step #3 above, and you could use this notebook for review & deciding which plots to convert to image files later. I'll point out that if you are sharing the output of the plots with someone who isn't familiar with Jupyter notebooks, you may wish to use File
> Export Notebook As..
> Export Notebook to PDF
to also generate a PDF and then download that file as well. More concisely, I suggest if you used either 'lightweight' settings, you also save and download this notebook to accompany the archive from the run, and optionally do the same for a PDF form of this notebook.
Please continue on to this notebook if you like some guidance accessing the files and data inside this archive. The accompanying notebook is entitled 'Accessing_and_reviewing_the_bendit_results.ipynb', and is available by clicking here. It outlines the various files produced, accessing the files, viewing the raw data, and how the data can be viewed with Python. Further, it illustrates how the produced plots can be adjusted for use elsewhere using Python. If you used either 'lightweight' settings, you'll want to see the second half of that guide in order to access content not included in the streamlined archive.
Before you thinking about running more, you'll want to bring any generated results from this remote session to your own computer. This session will go stale without any activity in 10 minutes, and so this is a very important step if you don't want to run things again.
Not quite what you needed? Or meed more?
If things worked well otherwise, you should be all ready to run again or run some different sample sets.
If there was error, delete any addition FASTA files made, verify your sequences, and try running again. To easily remove FASTA files, you can use the following; however, it will also remove any you have uploaded
!rm -rf *.fa
Otherwise, see the notebook guiding accessing and using the results made here by clicking here.