In this exercise we will explore the RNAcentral APIs to find some information about the most differentially expressed RNAs in an experiment.
The data for this exercise comes from this experiment: https://www.ebi.ac.uk/ena/data/view/PRJNA601750 a single-end RNA-seq dataset based on Drosophila melanogaster S2 cells sequenced under normal conditions and amino-acid starvation.
The data has been analysed using support protocol 3 detailed in our 2020 Current Protocols in Bioinformatics paper, which you can read here: https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/cpbi.104
The output is a table with URS_taxid (a unique RNA identifier) and some data about the differential expression of the RNA in this experiment. For the purposes of this webinar, we have prepared just the output, which we will analyse here. NOte - it has been slightly reduced in size to allow us to get things done in a reasonable amount of time.
The first step is to download the analysis output, which you can get from our FTP site
## Get all data we will need from RNAcentral FTP
!wget -O deseq2_results.csv https://ftp.ebi.ac.uk/pub/databases/RNAcentral/outreach/20231120-webinar/deseq2_results.csv
Next, we will import some useful python libraries
pandas
: The go-to dataframe library in python.requests
: To be able to make requests to the RNAcentral APIratelimit
: The RNAcentral API has a limit of 20 requests per second, we will try to respect thattqdm
: Gives us some nice progressbars and other feedback on progressif 'google.colab' in str(get_ipython()):
import pip
pip.main(['install', 'ratelimit'])
import pandas as pd
pd.set_option('display.width', None)
import requests
from ratelimit import limits
from tqdm.autonotebook import tqdm
tqdm.pandas()
de_data = pd.read_csv("deseq2_results.csv")
Now we have the data, we can filter out only those RNAs with significant differential expression, by looking at pvalue < 0.01
. The standard threshold for significance is 0.05
, but to cut down what we need to evaluate, we will set the bound tighter. We will visualise the dataframe again to get a feel for what's going on.
de_data = de_data[de_data["pvalue"] < 0.01]
with pd.option_context('display.max_colwidth', 20):
display(de_data)
You may recognise this as the output of the DESeq2 program.
So we have 36 RNAs to look at. Lets see what type of RNAs they are by querying the RNAcentral API. We will write a small python function to process the output of the API and append columns to our DESeq2 output.
Note that we try to respect the API rate limits, and we get a brand new dataframe with this data in it.
@limits(calls=20, period=1)
def get_rna_data(rnacentral_id):
url = f"https://rnacentral.org/api/v1/rna/{rnacentral_id}"
response = requests.get(url)
return response.json()
## Query the API to get what RNAcentral knows about these RNAs
r = de_data["urs_taxid"].progress_map(get_rna_data)
## Convert the results to a DataFrame
rnacentral_data = pd.DataFrame(list(r))
That should take less than a minute hopefully.
Let's visualise the data here in the notebook. Note - you should be able to scroll sideways to look at more columns
with pd.option_context('display.max_colwidth', 20):
display(rnacentral_data)
We get a lot more data than just the type of the RNA back, including:
From this, we can see that many of the most differentially expressed ncRNAs are lncRNAs. Lets quantify that with a little aggregation:
rnacentral_data.groupby("rna_type")['rna_type'].count()
So from this we can see most of the differentially expressed RNAs are indeed lncRNA, but there are a couple of interesting things - for example one mature miRNA is defferentially expressed. Let's have a look at that one:
miRNA = rnacentral_data.loc[rnacentral_data["rna_type"] == "miRNA"]
with pd.option_context('display.max_colwidth', None):
display(miRNA)
This gives us the genes we would need to search for if we wanted to find more information, e.g. dme-mir-4968
. Let's see what else we can get from RNAcentral's APIs.
We can use the LitScan API to search for relevant publications about this RNA:
@limits(calls=20, period=1)
def get_publications(genes):
format_gene_query = ['job_id:"{}"'.format(gene) for gene in genes]
url = f"https://www.ebi.ac.uk/ebisearch/ws/rest/rnacentral-litscan?query=entry_type:Publication AND ({' OR '.join(format_gene_query)})&fields=title,pmcid&format=json"
response = requests.get(url)
r = response.json()
titles = [hit['fields']['title'][0] for hit in r['entries']]
pmcids = [hit['fields']['pmcid'][0] for hit in r['entries']]
links = [f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/" for pmcid in pmcids]
publication_data = pd.DataFrame({"titles": titles, "pmcids": pmcids, "links": links})
return publication_data
## We can get away with this because we only have one RNA in our selection -
# think how it would need to change if we wanted to run across a whole dataset
publication_data = get_publications(set(list(miRNA['genes'])[0]))
with pd.option_context('display.max_colwidth', None):
display(publication_data)
So this RNA has been found in 4 publications, the titles of which you can see here, and you should be able to go and get the full text at these links.
Hopefully this gives you an idea how to get started using the RNAcentral APIs for your own analyses. If you want to play a little more with this data, you could try getting all the papers for other RNA types identified in this analysis, or filtering to only look at those RNAs with no papers about them, or try some of the other techniques we've talked about in this webinar!