Different ontologies exhibit different degrees of latticeyness. Highly latticed ontologies will have a combinatorial expolosion of paths to a root node.
This notebook has an analysis of path counts for the HPO
## We use a Factory object in the ontobio library
from ontobio import OntologyFactory
## Get the HPO using default method (currently OntoBee SPARQL)
## This may take 5-10s the first time you run it; afterwards it is cached
ofa = OntologyFactory()
ont = ofa.create('hp')
## The OWL version of HPO (used here) has many interesting relationship types;
## for now we just care about is-a (subClassOf between named classes)
ont = ont.subontology(relations='subClassOf')
## Get the root of the abnormality subset
[root] = ont.search('Phenotypic abnormality')
root
'HP:0000118'
## Arbitrary term
[t] = ont.search('Clinodactyly of the 3rd finger')
t
'HP:0040024'
## We use the standard python networkx library for pathfinding here
## This is easily extracted from an ontology object
from networkx import nx
G = ont.get_graph()
G
<networkx.classes.multidigraph.MultiDiGraph at 0x10d7292b0>
## number of paths
## (for the mapping of networkx to an ontology, source is root, and descendant is target)
len(list(nx.all_simple_paths(G, root, t)))
17
## nx returns a list of lists, each list is a path
## Examine the first 2
list(nx.all_simple_paths(G, root, t))[0:2]
[['HP:0000118', 'HP:0000924', 'HP:0040068', 'HP:0002813', 'HP:0011297', 'HP:0030084', 'HP:0040019', 'HP:0040024'], ['HP:0000118', 'HP:0000924', 'HP:0040068', 'HP:0002813', 'HP:0011297', 'HP:0001167', 'HP:0004097', 'HP:0009317', 'HP:0040024']]
Pandas are cute.
We use a DataFrame object, which we will construct by making a table of terms plus their pathstats
def get_pathstats(nodes):
"""
for any given node, return a table row with stats
"""
items = []
for n in nodes:
paths = list(nx.all_simple_paths(G, root, n))
longest = len(max(paths, key=lambda p: len(p)))
items.append({'id':n,
'label': ont.label(n),
'pathcount': len(paths),
'longest': longest})
return items
## Test it out
sample = list(ont.descendants(root))[0:20]
items = get_pathstats(sample)
items[0:3]
[{'id': 'HP:0005237', 'label': 'Degenerative liver disease', 'longest': 5, 'pathcount': 1}, {'id': 'HP:0002251', 'label': 'Aganglionic megacolon', 'longest': 8, 'pathcount': 3}, {'id': 'HP:0005102', 'label': 'Cochlear degeneration', 'longest': 6, 'pathcount': 1}]
## Look at same table in pandas
import pandas as pd
df = pd.DataFrame(items)
df
id | label | longest | pathcount | |
---|---|---|---|---|
0 | HP:0005237 | Degenerative liver disease | 5 | 1 |
1 | HP:0002251 | Aganglionic megacolon | 8 | 3 |
2 | HP:0005102 | Cochlear degeneration | 6 | 1 |
3 | HP:0006466 | Ankle contracture | 9 | 6 |
4 | HP:0004292 | Undermodelled hand bones | 6 | 1 |
5 | HP:0004839 | Pyropoikilocytosis | 7 | 1 |
6 | HP:0008970 | Scapulohumeral muscular dystrophy | 5 | 1 |
7 | HP:0008573 | Low-frequency sensorineural hearing impairment | 6 | 2 |
8 | HP:0005435 | Impaired T cell function | 8 | 3 |
9 | HP:0009218 | Fragmentation of the epiphysis of the middle p... | 13 | 96 |
10 | HP:0005021 | Bilateral elbow dislocations | 8 | 3 |
11 | HP:0010964 | Abnormality of long-chain fatty-acid metabolism | 5 | 1 |
12 | HP:0008019 | Superior lens subluxation | 9 | 1 |
13 | HP:0030883 | Femoroacetabular Impingement | 8 | 4 |
14 | HP:0005303 | Aortic arch calcification | 9 | 5 |
15 | HP:0000741 | Apathy | 7 | 1 |
16 | HP:0040208 | Elevated CSF biopterin level | 7 | 2 |
17 | HP:0030031 | Small toe | 10 | 13 |
18 | HP:0025348 | Abnormality of the corneal limbus | 7 | 1 |
19 | HP:0100720 | Hypoplasia of the ear cartilage | 5 | 1 |
## Basic aggregate stats (over our small sample, which may not be representative)
df['pathcount'].mean()
7.3499999999999996
Let's do a simple barchart showing distribution of pathcounts for our sample
import plotly.plotly as py
import plotly.graph_objs as go
data = [
go.Bar(
x=df['label'], # assign x as the dataframe column 'x'
y=df['pathcount']
)
]
# IPython notebook
py.iplot(data, filename='pandas-bar-chart')
# use this in non-notebook context
# url = py.plot(data, filename='pandas-bar-chart')
warning this can take over an hour, if running interactively, be patient!
help wanted is there a way to make Jupyter show a progress bar for cases like this?
sample = list(ont.descendants(root))
items = get_pathstats(sample)
items[0:3]
[{'id': 'HP:0005237', 'label': 'Degenerative liver disease', 'longest': 5, 'pathcount': 1}, {'id': 'HP:0002251', 'label': 'Aganglionic megacolon', 'longest': 8, 'pathcount': 3}, {'id': 'HP:0005102', 'label': 'Cochlear degeneration', 'longest': 6, 'pathcount': 1}]
len(items)
12066
df = pd.DataFrame(items)
df['pathcount'].mean()
6.6176031824962704
df['pathcount'].max()
200
data = [
go.Bar(
x=df['label'], # assign x as the dataframe column 'x'
y=df['pathcount']
)
]
# IPython notebook
py.iplot(data, filename='pandas-bar-chart-all')
data = [
go.Scatter(
x=df['longest'], # assign x as the dataframe column 'x'
y=df['pathcount'],
mode = 'markers'
)
]
# IPython notebook
py.iplot(data, filename='pandas-longest-vs-numpaths')
max_num_paths = df['pathcount'].max()
nodes_with_max = [x['id'] for x in items if x['pathcount'] == max_num_paths]
nodes_with_max
['HP:0100379', 'HP:0010432', 'HP:0010102', 'HP:0100378']
[ont.label(n) for n in nodes_with_max]
['Aplasia of the distal phalanx of the 4th toe', 'Absent distal phalanx of the 2nd toe', 'Aplasia of the distal phalanx of the hallux', 'Absent distal phalanx of the 3rd toe']
len(nodes_with_max)
4
## Pick an arbitrary term from list
t = nodes_with_max[0]
ancs = ont.ancestors(t, reflexive=True)
ancs = [a for a in ancs if a.startswith('HP:')]
len(ancs)
36
## Make a sub-ontology with just term and ancestors
subont = ont.subontology(ancs)
sample_path = list(nx.all_simple_paths(G, root, t))[0]
sample_path
['HP:0000118', 'HP:0000924', 'HP:0040068', 'HP:0040069', 'HP:0006493', 'HP:0006494', 'HP:0001991', 'HP:0010760', 'HP:0010185', 'HP:0100370', 'HP:0100379']
## Render the sub-ontology,
## highlighting a sample path
from ontobio.io.ontol_renderers import GraphRenderer
w = GraphRenderer.create('png')
w.outfile = 'output/multipath.png'
w.write(subont,query_ids=sample_path)