These electronic exercises (with solutions) accompany the book chapter "A Gene Ontology Tutorial in Python" by Alex Warwick Vesztrocy and Christophe Dessimoz, to appear in *The Gene Ontology Handbook*, C Dessimoz and N Skunca Eds, Springer Humana.

Version: 1.0.2 (Feb 2019): *Updated QuickGO API calls and usage of GOATOOLS to version 0.8.12*

First, we need to load the GOATools library. This enables us to parse the Gene Ontology (GO) OBO file. For more information on GOATools, see their documentation.

In [1]:

```
# Import the OBO parser from GOATools
from goatools import obo_parser
```

In order to download the GO OBO file, we also require the `wget`

and `os`

libraries.

In [2]:

```
import wget
import os
```

`'./data'`

folder using the following. We are going to download the `go-basic.obo`

version of the ontology, which is guaranteed to be acyclic, which means that annotations can be propagated up the graph.

In [3]:

```
go_obo_url = 'http://purl.obolibrary.org/obo/go/go-basic.obo'
data_folder = os.getcwd() + '/data'
# Check if we have the ./data directory already
if(not os.path.isfile(data_folder)):
# Emulate mkdir -p (no error if folder exists)
try:
os.mkdir(data_folder)
except OSError as e:
if(e.errno != 17):
raise e
else:
raise Exception('Data path (' + data_folder + ') exists as a file. '
'Please rename, remove or change the desired location of the data path.')
# Check if the file exists already
if(not os.path.isfile(data_folder+'/go-basic.obo')):
go_obo = wget.download(go_obo_url, data_folder+'/go-basic.obo')
else:
go_obo = data_folder+'/go-basic.obo'
```

The path to the GO OBO file is now stored in the variable `go_obo`

.

In [4]:

```
print(go_obo)
```

/Users/alex/go-handbook-live/data/go-basic.obo

Now we can create a dictionary of the GO terms, using the `obo_parser`

from GOATools.

In [5]:

```
go = obo_parser.GODag(go_obo)
```

/Users/alex/go-handbook-live/data/go-basic.obo: fmt(1.2) rel(2019-01-27) 47,381 GO Terms

In [6]:

```
go_id = 'GO:0048527'
go_term = go[go_id]
print(go_term)
```

GO:0048527 level-05 depth-06 lateral root development [biological_process]

In [7]:

```
print('GO term name: {}'.format(go_term.name))
print('GO term namespace: {}'.format(go_term.namespace))
```

GO term name: lateral root development GO term namespace: biological_process

What are the immediate parent(s) of the term `GO:00048527`

?

In [8]:

```
for term in go_term.parents:
print(term)
```

GO:0048528 level-04 depth-05 post-embryonic root development [biological_process]

What are the immediate children of the term `GO:00048527`

?

In [9]:

```
for term in go_term.children:
print(term)
```

Recursively find all the parent and child terms of the term `GO:00048527`

. *Hint*: use your solutions to the previous two questions, with a recursive loop.

In [10]:

```
def transitive_closure(go_term, go):
go_term_set = set()
find_parents(go_term, go, go_term_set)
find_children(go_term, go, go_term_set)
return go_term_set
def find_parents(term1, go, go_term_set={}, ret=False):
for term2 in term1.parents:
go_term_set.update({term2})
# Recurse on term to find all parents
find_parents(term2, go, go_term_set)
if(ret):
return go_term_set
def find_children(term1, go, go_term_set={}, ret=False):
for term2 in term1.children:
go_term_set.update({term2})
# Recurse on term to find all children
find_children(term2, go, go_term_set)
if(ret):
return go_term_set
```

Now we can create the transitive closure as a set by calling the following.

In [11]:

```
go_term_set = transitive_closure(go_term, go)
```

We can now print this out using the "pretty print"f eature that is inherited from GOATools.

In [12]:

```
for term in go_term_set:
print(term)
```

An alternative method is by using an inbuilt function of GOATools.

In [13]:

```
rec = go[go_id]
parents = rec.get_all_parents()
children = rec.get_all_children()
```

In [14]:

```
for term in parents.union(children):
print(go[term])
```

How many GO terms have the word “growth” in their name?

To do this, we can loop around every single GO term and check their name.

In [15]:

```
growth_count = 0
for go_term in go.values():
if 'growth' in go_term.name:
growth_count += 1
print('Number of GO terms with "growth" in their name: {}'.format(growth_count))
```

Number of GO terms with "growth" in their name: 626

What is the deepest common ancestor term of `GO:0048527`

and `GO:0097178`

?

First, we can write a function that finds the common GO terms.

In [16]:

```
def common_parent_go_ids(terms, go):
'''
This function finds the common ancestors in the GO
tree of the list of terms in the input.
'''
# Find candidates from first
rec = go[terms[0]]
candidates = rec.get_all_parents()
candidates.update({terms[0]})
# Find intersection with second to nth term
for term in terms[1:]:
rec = go[term]
parents = rec.get_all_parents()
parents.update({term})
# Find the intersection with the candidates, and update.
candidates.intersection_update(parents)
return candidates
```

Then we can define the deepest common ancestor of two terms, as follows:

In [17]:

```
def deepest_common_ancestor(terms, go):
'''
This function gets the nearest common ancestor
using the above function.
Only returns single most specific - assumes unique exists.
'''
# Take the element at maximum depth.
return max(common_parent_go_ids(terms, go), key=lambda t: go[t].depth)
```

Then we can find the deepest common ancestor of the two terms in question.

In [18]:

```
go_id1 = 'GO:0097178'
go_id_id1_dca = deepest_common_ancestor([go_id, go_id1], go)
```

In [19]:

```
print('The nearest common ancestor of\n\t{} ({})\nand\n\t{} ({})\nis\n\t{} ({})'
.format(go_id, go[go_id].name,
go_id1, go[go_id1].name,
go_id_id1_dca, go[go_id_id1_dca].name))
```

We first have to create the query to return the record for the GO term `GO:0097190`

.

In [20]:

```
go_id2 = 'GO:0097192'
rec = go[go_id2]
```

We then need to decide where to save the image file.

In [21]:

```
lineage_png = go_id2 + '-lineage.png'
```

In [22]:

```
go.draw_lineage([rec], lineage_img=lineage_png)
```

lineage info for terms ['GO:0097192'] written to GO:0097192-lineage.png

*Note*, the GOATools `draw_lineage`

function will output a PNG to `./GO_lineage.png`

, or whatever is defined by the argument `lineage_img`

. This can be viewed with your system viewer, or included here by running something similar to the following code. It may be necessary to double click on the image and scroll, in order to see the detail.

In [23]:

```
from IPython.display import Image
Image(lineage_png)
```

Out[23]:

Using this figure, what is the most specific term that is in the parent terms of both `GO:0097191`

and `GO:0038034`

? This is also referred to as the lowest common ancestor.

This is possible to read off the graph - the term `GO:0007165`

, *signal transduction*.

Using the `get_term()`

function, listed below, answer the following questions.

*Note*: the `get_oboxml()`

function listed in the chapter, in Source Code 2.1, will no longer work. This is due to an API overhaul of the EMBL-EBI's QuickGO browser.

For the interested reader, it is also possible to use the `bioservices`

library, in order to retrieve information from QuickGO (as well as many other web services).

In [24]:

```
from future.standard_library import install_aliases
install_aliases()
from urllib.request import urlopen
import json
def get_term(go_id):
"""
This function retrieves the definition of a given Gene Ontology term,
using EMBL-EBI's QuickGO browser.
Input: go_id - a valid Gene Ontology ID, e.g. GO:0048527.
"""
quickgo_url = "https://www.ebi.ac.uk/QuickGO/services/ontology/go/terms/" + go_id
ret = urlopen(quickgo_url)
# Check the response
if(ret.getcode() == 200):
term = json.loads(ret.read())
return term['results'][0]
else:
raise ValueError("Couldn't receive information from QuickGO. Check GO ID and try again.")
```

Find the name and description of the GO term `GO:0048527`

. *Hint*: print out the dictionary returned by the function and study its structure.

First, let's get the term information.

In [25]:

```
term = get_term(go_id)
```

It might be useful to study the structure of this dictionary, in order to answer this question.

In [26]:

```
term
```

Out[26]:

{'id': 'GO:0048527', 'isObsolete': False, 'name': 'lateral root development', 'definition': {'text': 'The process whose specific outcome is the progression of the lateral root over time, from its formation to the mature structure. A lateral root is one formed from pericycle cells located on the xylem radius of the root, as opposed to the initiation of the main root from the embryo proper.'}, 'children': [{'id': 'GO:1901332', 'relation': 'negatively_regulates'}, {'id': 'GO:1901333', 'relation': 'positively_regulates'}, {'id': 'GO:0010102', 'relation': 'part_of'}, {'id': 'GO:2000023', 'relation': 'regulates'}, {'id': 'GO:1902089', 'relation': 'part_of'}], 'aspect': 'biological_process', 'usage': 'Unrestricted'}

Now we can extract the relevant data.

In [27]:

```
print('Description of {} is:\n\t{}'.format(go_id, term['definition']['text']))
```

Look at the difference in the OBO-XML output for the GO terms `GO:00048527`

and `GO:0097178`

, then generate a table of the synonymous relationships of the term `GO:0097178`

.

Firstly, generate the OBO-XML dictionary for this new term.

In [28]:

```
term1 = get_term(go_id1)
```

We can then extract the synonyms as follows.

In [29]:

```
term1
```

Out[29]:

{'id': 'GO:0097178', 'isObsolete': False, 'name': 'ruffle assembly', 'definition': {'text': 'The aggregation, arrangement and bonding together of a set of components to form a ruffle, a projection at the leading edge of a crawling cell; the protrusions are supported by a microfilament meshwork. The formation of ruffles (also called membrane ruffling) is thought to be controlled by a group of enzymes known as Rho GTPases, specifically RhoA, Rac1 and cdc42.', 'xrefs': [{'dbCode': 'PMID', 'dbId': '12556481'}, {'dbCode': 'URL', 'dbId': 'http:en.wikipedia.org/wiki/Membrane_ruffling'}]}, 'synonyms': [{'name': 'membrane ruffle formation', 'type': 'exact'}, {'name': 'membrane ruffling', 'type': 'exact'}], 'children': [{'id': 'GO:1900029', 'relation': 'positively_regulates'}, {'id': 'GO:1900028', 'relation': 'negatively_regulates'}, {'id': 'GO:1900027', 'relation': 'regulates'}], 'aspect': 'biological_process', 'usage': 'Unrestricted'}

In [30]:

```
term1['synonyms']
```

Out[30]:

[{'name': 'membrane ruffle formation', 'type': 'exact'}, {'name': 'membrane ruffling', 'type': 'exact'}]

In [31]:

```
synonyms = term1['synonyms']
print(synonyms)
```

`AsciiTable`

module in the terminaltables package (which you may need to install).

In [32]:

```
from terminaltables import AsciiTable
# Initialise table data and set header
table_data = [['Type', 'Synonym']]
for synonym in synonyms:
table_data.append([synonym['type'], synonym['name']])
print(AsciiTable(table_data).table)
```

In this section we will look at how to manipulate the Gene Association File (GAF) standard, using a parser from the BioPython package.

In [33]:

```
import Bio.UniProt.GOA as GOA
```

*Arabidopsis Thaliana*.

In [34]:

```
import os
from ftplib import FTP
arab_uri = '/pub/databases/GO/goa/ARABIDOPSIS/goa_arabidopsis.gaf.gz'
arab_fn = arab_uri.split('/')[-1]
# Check if the file exists already
arab_gaf = os.path.join(data_folder, arab_fn)
if(not os.path.isfile(arab_gaf)):
# Login to FTP server
ebi_ftp = FTP('ftp.ebi.ac.uk')
ebi_ftp.login() # Logs in anonymously
# Download
with open(arab_gaf,'wb') as arab_fp:
ebi_ftp.retrbinary('RETR {}'.format(arab_uri), arab_fp.write)
# Logout from FTP server
ebi_ftp.quit()
```

`Bio.UniProt.GOA.gafiterator`

).

In [35]:

```
import gzip
# File is a gunzip file, so we need to open it in this way
with gzip.open(arab_gaf, 'rt') as arab_gaf_fp:
arab_funcs = {} # Initialise the dictionary of functions
# Iterate on each function using Bio.UniProt.GOA library.
for entry in GOA.gafiterator(arab_gaf_fp):
uniprot_id = entry.pop('DB_Object_ID')
arab_funcs[uniprot_id] = entry
```

In [36]:

```
print(arab_funcs[list(arab_funcs.keys())[0]])
```

Find the total number of annotations for Arabidopsis thaliana with `NOT`

qualifiers. What is this as a percentage of the total number of annotations for this species?

This is possible by looping through each of the values and checking whether `NOT`

is listed as one of the qualifiers.

Even though here it doesn't make a difference if we check either

`if 'NOT' in func['Qualifier']`

or

`if 'NOT' in func['Qualifier'][0]`

the first is preferred. This is because there can be multiple qualifiers for a given annotation.

In [37]:

```
not_count = 0
total_count = len(arab_funcs)
for func in arab_funcs.values():
if 'NOT' in func['Qualifier']:
not_count += 1
print('Total count of NOT qualifiers: {} ({}%)'.format(not_count, round(((float(not_count)/float(total_count))*100),2)))
print('Total number of annotations: {}'.format(total_count))
```

Total count of NOT qualifiers: 802 (3.17%) Total number of annotations: 25289

How many genes (of *Arabidopsis thaliana*) have the annotation `GO:0048527`

?

This is done by checking each annotation if its `GO_ID`

entry is equal to the required term's ID.

Further, here there has been a check on the taxonomic ID. This isn't strictly necessary, but would be required if the annotations database contained multiple species.

In [38]:

```
arab_tax_id = 3702 # This isn't necessary here, but would be with the full data set.
annot_count = 0
counted_gene = []
for uniprot_id in arab_funcs:
if('taxon:' + str(arab_tax_id) in arab_funcs[uniprot_id]['Taxon_ID']):
if((arab_funcs[uniprot_id]['GO_ID'] == go_id)):
counted_gene.append(uniprot_id)
annot_count += 1
del counted_gene
```

We can now find the number of genes:

In [39]:

```
print(annot_count)
```

17

Generate a list of annotated proteins which have the word *“growth”* in their name.

`DB_Object_Name`

field for the keyword.

In [40]:

```
keyword = 'growth'
growth_dict = {x: arab_funcs[x]
for x in arab_funcs
if keyword in arab_funcs[x]['DB_Object_Name']}
```

In [41]:

```
print('UniProt IDs of annotations with "growth" in their name:')
for annot in growth_dict:
print("\t - " + annot)
print("Total: {}".format(len(growth_dict)))
```

There are 21 evidence codes used in the Gene Ontology project. As discussed in Chapter 3, many of these are inferred, either by curators or automatically. Find the counts of each evidence code for in the Arabidopsis thaliana annotation file.

In [42]:

```
evidence_count = {}
for annotation in arab_funcs:
evidence = arab_funcs[annotation]['Evidence']
if(evidence not in evidence_count):
evidence_count[evidence] = 1
else:
evidence_count[evidence] += 1
```

The counts are then:

In [43]:

```
table_data = [['Evidence Code', 'Count']]
for code in sorted(evidence_count.keys()):
table_data.append([code, str(evidence_count[code])])
print(AsciiTable(table_data).table)
```

(All others are zero.)

To help visualise the counts of each evidence code found in the previous question, produce a pie chart showing the proportion of annotations with each evidence code for the *Arabidopsis thaliana* annotations dataset.

In order to draw the pie chart, we need to convert the counts into percentages.

In [44]:

```
evidence_percent = {}
for code in evidence_count:
evidence_percent[code] = ((float(evidence_count[code]) /
float(total_count))
*100)
```

Now we can have a look at the counts and percentages for each evidence code.

In [45]:

```
table_data = [['Evidence Code', 'Count', 'Percentage (%)']]
for code in sorted(evidence_count.keys()):
table_data.append([code, str(evidence_count[code]), str(round(evidence_percent[code],2))])
print(AsciiTable(table_data).table)
```

In [46]:

```
# Declare matplotlib as inline and import pyplot
%matplotlib inline
from matplotlib import pyplot
# Setup the figure
fig = pyplot.figure(1, figsize=(8,8), dpi=96)
ax=fig.add_axes([0.1,0.1,0.8,0.8])
# Extract the lables / percentages
labels = evidence_percent.keys()
fracs = evidence_percent.values()
# Make IEA obvious by "exploding" it
explode = [int('IEA' in x)*0.1 for x in labels]
# Plot the pie chart
patches, texts = ax.pie(list(fracs), explode=explode, startangle=90, labeldistance=1.1)
# Add
ax.legend(patches, labels, bbox_to_anchor=(1.2, 0.75), fontsize=12)
pyplot.show()
```

`GOEnrichmentStudy()`

function from the GOATools library will be used to perform GO enrichment analysis.

In [47]:

```
from goatools.go_enrichment import GOEnrichmentStudy
```

*"growth"* keyword from exercise 3.1.c, and the GO structure from exercise 2.1.

The population is the functions observed in the *Arabidopsis thaliana* GOA file.

In [48]:

```
pop = arab_funcs.keys()
```

In [49]:

```
assoc = {}
for x in arab_funcs:
if x not in assoc:
assoc[x] = set()
assoc[x].add(str(arab_funcs[x]['GO_ID']))
```

Now, the study set here is those genes with the *"growth"* keyword, found previously.

In [50]:

```
study = growth_dict.keys()
```

Which GO term is most significantly enriched or depleted? Does this make sense?

Possible methods for the GOEnrichmentStudy are:

In [51]:

```
methods = ["bonferroni", "sidak", "holm", "fdr"]
```

It is possible to run on all methods, or just a subset.

In [52]:

```
g = GOEnrichmentStudy(pop, assoc, go,
propagate_counts=True,
alpha=0.05,
methods=methods)
g_res = g.run_study(study)
```

fisher module not installed. Falling back on scipy.stats.fisher_exact

Propagating term counts to parents ..

100% 25,289 of 25,289 population items found in association 100% 13 of 13 study items found in association 100% 13 of 13 study items found in population(25289) Calculating 5,364 uncorrected p-values using fisher_scipy_stats 5,364 GO terms are associated with 25,289 of 25,289 population items 40 GO terms are associated with 13 of 13 study items 3 GO terms found significant (< 0.05=alpha) after multitest correction: local bonferroni 3 GO terms found significant (< 0.05=alpha) after multitest correction: local sidak 3 GO terms found significant (< 0.05=alpha) after multitest correction: local holm

Generate p-value distribution for FDR based on resampling (this might take a while) Sample 0 / 500: p-value 0.0010278710533060788

fisher module not installed. Falling back on scipy.stats.fisher_exact

Sample 10 / 500: p-value 0.0015414407798976498 Sample 20 / 500: p-value 0.0005140574953738786 Sample 30 / 500: p-value 0.0005140574953738786 Sample 40 / 500: p-value 3.303147547493567e-05 Sample 50 / 500: p-value 0.0005140574953738786 Sample 60 / 500: p-value 0.0005140574953738786 Sample 70 / 500: p-value 0.0005140574953738786 Sample 80 / 500: p-value 0.0076852972242090695 Sample 90 / 500: p-value 0.0010278710533060788 Sample 100 / 500: p-value 0.002054766781135086 Sample 110 / 500: p-value 0.0010278710533060788 Sample 120 / 500: p-value 0.0005140574953738786 Sample 130 / 500: p-value 9.826609994798967e-05 Sample 140 / 500: p-value 0.000318804720807801 Sample 150 / 500: p-value 0.0010278710533060788 Sample 160 / 500: p-value 4.150813978478498e-05 Sample 170 / 500: p-value 0.0010278710533060788 Sample 180 / 500: p-value 0.00026027744906719874 Sample 190 / 500: p-value 0.003593283493731873 Sample 200 / 500: p-value 0.0010278710533060788 Sample 210 / 500: p-value 0.002054766781135086 Sample 220 / 500: p-value 0.0005140574953738786 Sample 230 / 500: p-value 0.0010278710533060788 Sample 240 / 500: p-value 0.0015414407798976498 Sample 250 / 500: p-value 0.004105635653855762 Sample 260 / 500: p-value 0.002054766781135086 Sample 270 / 500: p-value 0.002054766781135086 Sample 280 / 500: p-value 0.0005140574953738786 Sample 290 / 500: p-value 0.0005140574953738786 Sample 300 / 500: p-value 0.003593283493731873 Sample 310 / 500: p-value 0.0005140574953738786 Sample 320 / 500: p-value 0.000160821606967919 Sample 330 / 500: p-value 0.0005140574953738786 Sample 340 / 500: p-value 0.0005140574953738786 Sample 350 / 500: p-value 0.0005140574953738786 Sample 360 / 500: p-value 3.303147547493567e-05 Sample 370 / 500: p-value 0.0005140574953738786 Sample 380 / 500: p-value 0.0010278710533060788 Sample 390 / 500: p-value 0.0010278710533060788 Sample 400 / 500: p-value 0.0010278710533060788 Sample 410 / 500: p-value 0.0010278710533060788 Sample 420 / 500: p-value 0.0005140574953738786 Sample 430 / 500: p-value 0.0010278710533060788 Sample 440 / 500: p-value 0.0005140574953738786 Sample 450 / 500: p-value 0.0005140574953738786 Sample 460 / 500: p-value 0.0010278710533060788 Sample 470 / 500: p-value 0.0005140574953738786 Sample 480 / 500: p-value 0.0010278710533060788 Sample 490 / 500: p-value 0.0010278710533060788

5 GO terms found significant (< 0.05=alpha) after multitest correction: local fdr

In [53]:

```
g.print_results(g_res, min_ratio=None, pval=0.01)
```

`GO:0030154`

(cell differentiation). It makes sense that this would be associated with proteins having the keyword *"growth"* in their name.

How many terms are enriched, when using the Bonferroni corrected method and a p-value $\le$ 0.01?

Perform the GO Enrichment study using the Bonferroni corrected method.

In [54]:

```
g_bonferroni = GOEnrichmentStudy(pop, assoc, go,
propagate_counts=True,
alpha=0.01,
methods=['bonferroni'])
g_bonferroni_res = g_bonferroni.run_study(study)
```

fisher module not installed. Falling back on scipy.stats.fisher_exact

Propagating term counts to parents ..

100% 25,289 of 25,289 population items found in association 100% 13 of 13 study items found in association 100% 13 of 13 study items found in population(25289) Calculating 5,364 uncorrected p-values using fisher_scipy_stats 5,364 GO terms are associated with 25,289 of 25,289 population items 40 GO terms are associated with 13 of 13 study items 3 GO terms found significant (< 0.01=alpha) after multitest correction: local bonferroni

In [55]:

```
s_bonferroni = []
for x in g_bonferroni_res:
if x.p_bonferroni <= 0.01:
s_bonferroni.append((x.goterm.id, x.p_bonferroni))
```

In [56]:

```
print(s_bonferroni)
```

How many terms are enriched with false discovery rate (a.k.a. q-value) $\le$ 0.01?

In [57]:

```
g_fdr = GOEnrichmentStudy(pop, assoc, go,
propagate_counts=True,
alpha=0.05,
methods=['fdr'])
g_fdr_res = g_fdr.run_study(study)
```

fisher module not installed. Falling back on scipy.stats.fisher_exact

Propagating term counts to parents ..

100% 25,289 of 25,289 population items found in association 100% 13 of 13 study items found in association 100% 13 of 13 study items found in population(25289) Calculating 5,364 uncorrected p-values using fisher_scipy_stats 5,364 GO terms are associated with 25,289 of 25,289 population items 40 GO terms are associated with 13 of 13 study items fisher module not installed. Falling back on scipy.stats.fisher_exact

Generate p-value distribution for FDR based on resampling (this might take a while) Sample 0 / 500: p-value 0.011253078458717746 Sample 10 / 500: p-value 0.0005140574953738786 Sample 20 / 500: p-value 0.0010278710533060788 Sample 30 / 500: p-value 0.0005140574953738786 Sample 40 / 500: p-value 0.0005140574953738786 Sample 50 / 500: p-value 0.0008701371457487843 Sample 60 / 500: p-value 0.0005140574953738786 Sample 70 / 500: p-value 0.0005140574953738786 Sample 80 / 500: p-value 0.0015414407798976498 Sample 90 / 500: p-value 0.0010278710533060788 Sample 100 / 500: p-value 0.0005140574953738786 Sample 110 / 500: p-value 0.003593283493731873 Sample 120 / 500: p-value 0.0010278710533060788 Sample 130 / 500: p-value 0.0005140574953738786 Sample 140 / 500: p-value 0.006663750638263605 Sample 150 / 500: p-value 0.003080688032113681 Sample 160 / 500: p-value 0.0015414407798976498 Sample 170 / 500: p-value 0.0010278710533060788 Sample 180 / 500: p-value 0.004105635653855762 Sample 190 / 500: p-value 0.0010278710533060788 Sample 200 / 500: p-value 0.0005140574953738786 Sample 210 / 500: p-value 0.0010278710533060788 Sample 220 / 500: p-value 0.003593283493731873 Sample 230 / 500: p-value 0.0005140574953738786 Sample 240 / 500: p-value 0.0010278710533060788 Sample 250 / 500: p-value 0.0005140574953738786 Sample 260 / 500: p-value 0.0005140574953738786 Sample 270 / 500: p-value 0.0010278710533060788 Sample 280 / 500: p-value 0.003080688032113681 Sample 290 / 500: p-value 0.0005140574953738786 Sample 300 / 500: p-value 0.0031108913301437103 Sample 310 / 500: p-value 0.0010278710533060788 Sample 320 / 500: p-value 0.0010278710533060788 Sample 330 / 500: p-value 0.0005140574953738786 Sample 340 / 500: p-value 0.0010278710533060788 Sample 350 / 500: p-value 0.011761793778736767 Sample 360 / 500: p-value 0.0010278710533060788 Sample 370 / 500: p-value 0.0005140574953738786 Sample 380 / 500: p-value 0.0005140574953738786 Sample 390 / 500: p-value 0.0005140574953738786 Sample 400 / 500: p-value 0.011253078458717746 Sample 410 / 500: p-value 0.0005140574953738786 Sample 420 / 500: p-value 0.0005140574953738786 Sample 430 / 500: p-value 0.0010278710533060788 Sample 440 / 500: p-value 0.0005140574953738786 Sample 450 / 500: p-value 0.0010278710533060788 Sample 460 / 500: p-value 0.0010278710533060788 Sample 470 / 500: p-value 0.0005140574953738786 Sample 480 / 500: p-value 0.002567849163328763 Sample 490 / 500: p-value 0.0005140574953738786

5 GO terms found significant (< 0.05=alpha) after multitest correction: local fdr

In [58]:

```
s_fdr = []
for x in g_fdr_res:
if x.p_fdr <= 0.01:
s_fdr.append((x.goterm.id, x.p_fdr))
```

In [59]:

```
print(s_fdr)
```

[('GO:0030154', 0.0), ('GO:0048869', 0.0), ('GO:0032502', 0.0), ('GO:2000012', 0.006)]

In this section we look at how to compute semantic similarity between GO terms.

In [60]:

```
from collections import Counter
class TermCounts():
'''
TermCounts counts the term counts for each
'''
def __init__(self, go, annots):
'''
Initialise the counts and
'''
# Backup
self._go = go
# Initialise the counters
self._counts = Counter()
self._aspect_counts = Counter()
# Fill the counters...
self._count_terms(go, annots)
def _count_terms(self, go, annots):
'''
Fills in the counts and overall aspect counts.
'''
for x in annots:
# Extract term information
go_id = annots[x]['GO_ID']
namespace = go[go_id].namespace
self._counts[go_id] += 1
rec = go[go_id]
parents = rec.get_all_parents()
for p in parents:
self._counts[p] += 1
self._aspect_counts[namespace] += 1
def get_count(self, go_id):
'''
Returns the count of that GO term observed in the annotations.
'''
return self._counts[go_id]
def get_total_count(self, aspect):
'''
Gets the total count that's been precomputed.
'''
return self._aspect_counts[aspect]
def get_term_freq(self, go_id):
'''
Returns the frequency at which a particular GO term has
been observed in the annotations.
'''
try:
namespace = self._go[go_id].namespace
freq = float(self.get_count(go_id)) / float(self.get_total_count(namespace))
except ZeroDivisionError:
freq = 0
return freq
```

`GO:0048364`

(root development) and `GO:0044707`

(single-multicellular organism process) are two GO terms taken from Figure 1. Calculate the semantic similarity between them based on the inverse of the semantic distance (number of branches separating them).

In [61]:

```
def min_branch_length(go_id1, go_id2, go):
'''
Finds the minimum branch length between two terms in the GO DAG.
'''
# First get the deepest common ancestor
dca = deepest_common_ancestor([go_id1, go_id2], go)
# Then get the distance from the DCA to each term
dca_depth = go[dca].depth
d1 = go[go_id1].depth - dca_depth
d2 = go[go_id2].depth - dca_depth
# Return the total distance - i.e., to the deepest common ancestor and back.
return d1 + d2
```

In [62]:

```
go_id3 = 'GO:0048364'
go_id4 = 'GO:0044707'
```

Now we can calculate the semantic distance and semantic similarity, as so:

In [63]:

```
def semantic_distance(go_id1, go_id2, go):
'''
Finds the semantic distance (minimum number of connecting branches)
between two GO terms.
'''
return min_branch_length(go_id1, go_id2, go)
def semantic_similarity(go_id1, go_id2, go):
'''
Finds the semantic similarity (inverse of the semantic distance)
between two GO terms.
'''
return 1.0 / float(semantic_distance(go_id1, go_id2, go))
```

In [64]:

```
sim = semantic_similarity(go_id3, go_id4, go)
print('The semantic similarity between terms {} and {} is {}.'.format(go_id3, go_id4, sim))
```

The semantic similarity between terms GO:0048364 and GO:0044707 is 0.2.

Calculate the the information content (IC) of the GO term `GO:0048364`

(root development), based on the frequency of observation in *Arabidopsis thaliana*.

First we need to define what the information content is.

In [65]:

```
import math
def ic(go_id, termcounts):
'''
Calculates the information content of a GO term.
'''
# Get the observed frequency of the GO term
freq = termcounts.get_term_freq(go_id)
# Calculate the information content (i.e., -log("freq of GO term")
return -1.0 * math.log(freq)
```

Then we can calculate the information content of the single term, `GO:0048364`

.

In [66]:

```
# First get the counts of each GO term.
termcounts = TermCounts(go, arab_funcs)
# Calculate the information content
infocontent = ic(go_id, termcounts)
```

In [67]:

```
print(infocontent)
```

6.703332455042499

Calculate the Resnik similarity measure between the same two terms as in 5.1.a.

In [68]:

```
def resnik_sim(go_id1, go_id2, go, termcounts):
'''
Computes Resnik's similarity measure.
'''
msca = deepest_common_ancestor([go_id1, go_id2], go)
return ic(msca, termcounts)
```

Then we can calculate this as follows

In [69]:

```
sim_r = resnik_sim(go_id3, go_id4, go, termcounts)
```

In [70]:

```
print('Resnik similarity score: {}'.format(sim_r))
```

Resnik similarity score: -0.0