#!/usr/bin/env python # coding: utf-8 # # Demonstrating `bendIt` running in Jupyter # # The standalone version of `bendIt` is installed and ready to be run in this launched Jupyter environment. (If you are viewing this notebook in static form, you should 'launch' it in active form from [here](https://github.com/fomightez/bendit-binder). # # If you want the basic usage block and an example of using it on the command line, go to [this notebook](basic_bendit_commandline.ipynb). # # This notebook will illustrate a realistic workflow where a number of sequences in a multi-sequence fasta file are analyzed with `bendIt`. This should serve to show the advantage of using the standalone version of bend.it over [the bend.it Server](http://pongor.itk.ppke.hu/dna/bend_it.html#/bendit_form) for processing more than a few sequences, and touch upon some of the benefits of having `bendIt` working in the Jupyter environment. # # ------ # #
#

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

# #

# Some tips: #

#

#
# # ---- # # ### STEP 1 : UPLOAD SEQUENCES. # # This demonstration scenario will analyze the properties of a series of sequence cassettes flanked by defined sequences. The curvature and bendability predictions for each resulting sequence will be plotted. # Click on and drag a file listing sequences in FASTA format from your computer into the file browser window to the left of this text. # When the file is correctly dragged into the pane, a dashed, gray outline will appear and you can release your mouse button. # # TO **RUN THE DEMO WITH A PROVIDED FASTA FILE: DON'T DRAG ANYTHING IN AND JUST GO AHEAD AND RUN THE CELLS BELOW. IT WILL USE THE FASTA FILE FROM THE bendit test folder ALREADY PRESENT HERE IF YOU DON'T UPLOAD ONE.** # # Change the file extension to `.fa` or `.fasta` (or even "faa", "fas", "fsa" work), if it isn't already. To do that right-click on the file name in the file navigation panel to the left, and select `Rename`. # Also, **remove any spaces in the file name**; an underscore makes a nice replacement. # # You can also drag in more FASTA files and each one will be processed and treated as a separate sample set. # # Run the following code cells to process the sequence(s) to make the plot(s). Change any settings you need to as described. # There are three ways to run a cell if you are not familiar with the JupyterLab interface. # # - You can run the cell by clicking on it and pressing the `run` button, shaped like a triangle heading towards the right, that is on the utility bar above this notebook. # # - Click on the cell to run to select it, and then under `Run` menu above, choose `Run Selected Cells` # # - Click on the cell to run to select it, and type `Shift-Enter`. Which is holding down the shift key wille pressing the enter key. # # ### STEP 2 : CHOSE `bendIt` SETTINGS. # There are several options that can be set for running `bendIt`. They come with inherent defaults that can be seen by running `!bendIt --help` as a cell in this notebook. # # Several of these are set to alternative settings using the following assignments. # # # Edit the text in the cell below to better reflect your choice of window for analysis, if you prefer. # In[ ]: window_size = 3 # In addition to predicting curvature, `bendIt` will also report G+C content, complexity, or a prediction of bendability. Here, we set this to `bendabilty`. To change, edit text between the quotes to `G+Ccontent` or `complexity`. # In[ ]: report_with_curvature = "bendability" # For the analysis a series of sequences defined in the FASTA file will be analyzed in the context of defined flanking sequences. The cell below assigns the sequences that flank the sequence cassette that will be substituted succesively and anlayzed. If instead of this swappable cassette scenario, you already have a multi-sequence fasta file containing the sequences you wish to analyze, you can change the bracket sequences to nothing by changing the settings for the bracket sequenes to the following in the cell below: # # ```python # #if you don't want defined flanking sequences added, use this code: # up_bracket = "" # down_bracket = "" # ``` # In[ ]: up_bracket = "gtaaaacgacggccagcatggaggtacaa" down_bracket = "gggaggtacttccatggtcatagctgtt" # ### STEP 2 : ADJUST SAMPLE SET HANDLING SETTINGS. # The script that will run here will use the file name of the uploaded multi-sequence FASTA file to determine a 'sample set name' to refer to the entire set of sequences and corresponding results. The text in the file name prior to occurence of the `character_to_mark_name_end` that is defined below, default is an underscrore, will be used as the name designation of the sample set. If you want to change the delimiter, edit `character_to_mark_name_end` in the pertient cell appearing in this section below. # If you'd like to override that process altogether and designate a specific name yourself, then edit the following cell so that `sample_set_name_extract_auto` is instead assigned to `False`. And then add a line below it where you assign `sample_set_name` to the name you want to specify, like the following where you'd replace `Name_here` with text to actually use: # # ```python # sample_set_name_extract_auto = False # sample_set_name = "Name_here" # ``` # # (Because input data provided by users wishing to use standalone bendIt version had slightly deviated from best practices of data handling by mixing the overarching label for the sample set in with the entries in the multi-FASTA file, the script will check if the label for the sample set has been placed as the first line above the listing of sequence entries and use that as the sample set name, in that case. And, so that is an alternative way to provide the sample set name.) # In[ ]: sample_set_name_extract_auto = True # In[ ]: character_to_mark_set_name_end = "_" # By default, the first 'word' in the description line of each cassette sequence will be used as the individual sample name. If you'd rather specify a different delimiter for the individual sample names, then change on the following line what is in the quotes to alter the `character_to_mark_individual_sample_name_end` setting. # In[ ]: character_to_mark_individual_sample_name_end = " " # In order to accommodate a user's request to make the resulting plots look similar to what is seen with Excel using the setting to graph 'Smooth Lines', the plots are 'smoothed'. If you don't want that, set `smooth_plot_curves` to `False` on the line below. # In[ ]: smooth_plot_curves = True # In order, to accommodate a user's request to allow a complex sample naming scheme in the description line, there is a sanitization step early on in order to avoid issues in processing files with `/`,`|`, and parantheses in bendIt, and then later the script tries to substitute back in the offending characters in the plot title, relying on the pattern. Every effort is made to make it restrict to match the presumed user pattern; however, set `show_date_with_slashes_in_plot_title` to `False` on the line below if you are seeing dash characters to show up as forward slashes in your plot title or if your underscores are showing as `|` or if `+` are being converted to parantheses. Likewise, if you are using the offending characters as part of the text that becomes the sample names, you can edit the script conditional code block that begins `if show_date_with_slashes_in_plot_title` to reverse the sanitizing step for the plot title by following the built-in example as a guide. # In[ ]: show_date_with_slashes_in_plot_title = True # The pipeline here generates plots of the data using Python. This is meant to make plots with modern features and aesthetics from each analyzed sequence without the need for further post-processing. Those familiar with using the bendit server may know that you can choose to have the server site output 'raw' gnuplot-generated plots of the data. The standalone version running here generates those 'raw gnuplots' by default. This setting below keeps the 'raw gnuplot' output as part of the collected output. However, once you establish the pipeline results in the same plots as the bendit server, you may wish to not collect these plots as part of the archived output in order to keep file sizes smaller. You can then change the following cell to read: # # ```python # include_gnuplots = False # ``` # # It is purposefully set to `True` below by default to encourage you to verify your data indeed gives the same results before and after any post-processing in this pipleine, and for direct comparison to the bendit server produced plots. # In[ ]: include_gnuplots = True # If you are processing a lot of sequences, you'll find with these default settings you generate a large output archive. For example, normally processing thirty 75 nt sequences results in archive of 5 Mb. Processing a hundred 75 nt sequences results in an archive of 17 Mb normally. There are two 'lightweight' options to help ease issues as you scale up squences analyzed: `lightweight_archive` setting and `lightweight_with_images` setting . You can set the `lightweight_archive` to `True` in other to make the output archive much smaller. As a gauge of the degree of space that can be saved, the archive size of the examples drops to 2 Mb and 6.9 Mb, respectively. # In[ ]: lightweight_archive = False # It is set to `False` by default. If you are **finding you are generating archives that are hard to deal with you'll want this set to `True`**. In addition, to leaving out the 'raw' gnuplots (regardless of the `include_gnuplots` setting), it doesn't save a notebook with each plot for convenient review and doesn't save every plot as images. Both of those typical by-products can still be made from the contents of the archive. The idea being that with a lot of sequences you may wish to save this notebook and the streamlined archive and then generate plot image files for only the bendIt results that are really of interest. The LOG file will only contain the end summary with the `lightweight_archive` setting. # # In addition, to the `lightweight_archive` setting there is a related `lightweight_with_images` setting. This setting acts as the `lightweight_archive` setting, with the addition of making the plot images as the individual sequences are processed. The plot **images are still not included in the archive**. # In[ ]: lightweight_with_images = False # If `lightweight_with_image` is set to true, it takes priority over whatever `lightweight_archive` is set to, so that you only have to worry about setting one or none of the 'lighweight' options. # # Set `cleaning_step = False` in the cell below to skip the 'clean up' step at the end. The 'clean up' step is a final step after the archive file is made that deletes all the individual files that had been made in the process of the run and are now cluttering the working directory. This setting is intended for use in conjunction with either 'lightweight' setting because if you've run a lot of sequences, you may be willing to forego the convenience of being able to find the archive among the long list of files in favor of more quickly getting to the moment you can download the archive file. # In[ ]: cleaning_step = True # Now with the sequences uploaded and the settings assigned, you are ready to begin the analysis by `bendIt`. # # ### STEP 3. ANALYZE THE SEQUENCES WITH `bendIt`. # # To start analyzing the sequences, run the next cell. This will take some time to run; however, feedback will be provided at several steps. When completed, results will be shown and below you'll be given options to collect the produced data. # # (For those interested in making a custom workflow that uses bendIt, you'll want to examine the script to try to adapt it to your needs.) # In[ ]: get_ipython().run_line_magic('run', '-i bendIt_analysis.py') # In[ ]: ls -lh # **CLEANING TAKES TIME. WAIT UNTIL CONTENTS OF DIRECTORY ARE LISTED ABOVE TO MOVE ON.** # I put a cell listing of the contents of the directory in the cell above because cleaning up takes a while and otherwise seems like ready to go to Step #4 when actually not ready. # ### STEP 4. REVIEW THE RESULTS AND COLLECT THE ARCHIVE OF THE RUN. # # If everything looks like it ran well above, and the listing of the files in the directory in the cell just above is showing. Then you should be almost set. Although many files are produced in the course of running the previous step, in the end there should just be one archive file resulting. The name if it will look something like the following with the `` text in the time and date stamp replaced by an abbreviation reflecting the current month: # # ``` # bendit_analysis1420201914.tar.gz # ``` # # **You should not need to save this particular notebook. EVERYTHING NEEDED SHOULD BE IN THE ARCHIVE.** # # A notebook ready to review the plots you see generated above has been saved as part of the packaged archive along with a log file of the run. Additionally, separate images of the plots are saved in the packaged archive and both Python-based and text forms of the data produced in the analysis. # # **Save the archive to your local computer.** # # Save the archive to your own computer by right-clicking on it in the file navigation panel on the right, and selecting `Download`. After a pause, a typical file download window should show up allowing you to save the file to your own computer. # # There is **a big caveat** covering the statements above about the archive containing all that you need. If you set either the `lightweight_archive` or `lightweight_with_images` setting to `True`, then a separate notebook for reviewing the plots won't be saved as part of the archive, and no image files will be saved as well. As the separate notebook meant for reviewing the plots won't be stored in the archive, you may wish to save and download this notebook to your local computer. All the plots should be visible under Step #3 above, and you could use this notebook for review & deciding which plots to convert to image files later. I'll point out that if you are sharing the output of the plots with someone who isn't familiar with Jupyter notebooks, you may wish to use `File` > `Export Notebook As..` > `Export Notebook to PDF` to also generate a PDF and then download that file as well. *More concisely, I suggest if you used either 'lightweight' settings, you also save and download this notebook to accompany the archive from the run, and optionally do the same for a PDF form of this notebook.* # # Please continue on to [this notebook](Accessing_and_reviewing_the_bendit_results.ipynb) if you like some guidance accessing the files and data inside this archive. The accompanying notebook is entitled 'Accessing_and_reviewing_the_bendit_results.ipynb', and is available by clicking [here](Accessing_and_reviewing_the_bendit_results.ipynb). It outlines the various files produced, accessing the files, viewing the raw data, and how the data can be viewed with Python. Further, it illustrates how the produced plots can be adjusted for use elsewhere using Python. If you used either 'lightweight' settings, you'll want to see the second half of that guide in order to access content not included in the streamlined archive. # # ----- # # ## More to analyze? # Before you thinking about running more, you'll want to bring any generated results from this remote session to your own computer. **This session will go stale without any activity in 10 minutes**, and so this is a **very important step if you don't want to run things again**. # # ### STEP 5. RUN THROUGH AGAIN? # # Not quite what you needed? Or meed more? # # If things worked well otherwise, you should be all ready to run again or run some different sample sets. # # If there was error, delete any addition FASTA files made, verify your sequences, and try running again. To easily remove FASTA files, you can use the following; however, it will also remove any you have uploaded # # ```python # !rm -rf *.fa # ``` # # Otherwise, see the notebook guiding accessing and using the results made here by clicking [here](Accessing_and_reviewing_the_bendit_results.ipynb). # # # -----