Why predicates are important¶

This notebook is intended as an explanatory guide to the importance of edge types (predicates) in ontologies.

Citric acid and its ion forms¶

For this guide, we are going to look at citric acid and it's conjugate forms such as citrate(3-). These chemical entities are very similar and in fact are readily interchangeable in cells.

Biologists and biochemists may talk of "citric acid" and "citrate" interchangeably.

We can see the corresponding CHEBI entries:

Citric acid¶

CHEBI:30769

Citrate(3-)¶

CHEBI:16947

Accessing CHEBI through OAK¶

There are different ways to access CHEBI, we will use the sqlite adapter. See also part 7 of the tutorial.

We will use the selector sqlite:obo:chebi to access CHEBI.

We will be using the command line interface via Jupyter for this tutortial, but the equivalent operations can be done via Python.

First we will set up a Jupyter alias.

In [1]:

%alias chebi runoak -i sqlite:obo:chebi

If we wanted to do the equivalen on the command line, we would do:

alias chebi="runoak -i sqlite:obo:chebi"

Basic lookup¶

Next we will do some basic lookup. The first time you run this may take some time, as the sqlite file is downloaded. Subsequent operations will be faster.

In [3]:

chebi info "citric acid"

CHEBI:30769 ! citric acid

Term metadata¶

To check we have the right term, let's look at all of the CHEBI metadata, including mappings and chemical formulae:

In [4]:

chebi term-metadata "citric acid"

IAO:0000115: A tricarboxylic acid that is propane-1,2,3-tricarboxylic acid bearing
  a hydroxy substituent at position 2. It is an important metabolite in the pathway
  of all aerobic organisms.
id: CHEBI:30769
obo:chebi/charge: '0'
obo:chebi/formula: C6H8O7
obo:chebi/inchi: InChI=1S/C6H8O7/c7-3(8)1-6(13,5(11)12)2-4(9)10/h13H,1-2H2,(H,7,8)(H,9,10)(H,11,12)
obo:chebi/inchikey: KRKNYBCHXYNGOX-UHFFFAOYSA-N
obo:chebi/mass: '192.123'
obo:chebi/monoisotopicmass: '192.02700'
obo:chebi/smiles: OC(=O)CC(O)(CC(O)=O)C(O)=O
oio:hasAlternativeId:
- CHEBI:23322
- CHEBI:3727
- CHEBI:41523
oio:hasDbXref:
- BPDB:1359
- Beilstein:782061
- CAS:77-92-9
- DrugBank:DB04272
- Drug_Central:666
- Gmelin:4240
- HMDB:HMDB0000094
- KEGG:C00158
- KEGG:D00037
- KNApSAcK:C00007619
- MetaCyc:CIT
- PDBeChem:CIT
- PMID:11762832
- PMID:11782123
- PMID:11857437
- PMID:14537820
- PMID:15311880
- PMID:15934243
- PMID:16232627
- PMID:17190852
- PMID:17357118
- PMID:17604395
- PMID:18298573
- PMID:18960216
- PMID:19288211
- PMID:22115968
- PMID:22192423
- PMID:22264346
- PMID:22373571
- PMID:22509852
- Reaxys:782061
- Wikipedia:Citric_Acid
oio:hasExactSynonym:
- 2-hydroxypropane-1,2,3-tricarboxylic acid
- CITRIC ACID
- Citric acid
oio:hasOBONamespace: chebi_ontology
oio:hasRelatedSynonym:
- 2-Hydroxy-1,2,3-propanetricarboxylic acid
- 2-Hydroxytricarballylic acid
- 3-Carboxy-3-hydroxypentane-1,5-dioic acid
- Citronensaeure
- E330
- H3cit
oio:id: CHEBI:30769
oio:inSubset: obo:chebi#3_STAR
rdfs:label: citric acid

---

We can do the same thing for the same chemical in a different protonation state, citrate(3-):

In [6]:

chebi term-metadata "citrate(3-)"

IAO:0000115: A tricarboxylic acid trianion, obtained by deprotonation of the three
  carboxy groups of citric acid.
id: CHEBI:16947
obo:chebi/charge: '-3'
obo:chebi/formula: C6H5O7
obo:chebi/inchi: InChI=1S/C6H8O7/c7-3(8)1-6(13,5(11)12)2-4(9)10/h13H,1-2H2,(H,7,8)(H,9,10)(H,11,12)/p-3
obo:chebi/inchikey: KRKNYBCHXYNGOX-UHFFFAOYSA-K
obo:chebi/mass: '189.09970'
obo:chebi/monoisotopicmass: '189.00517'
obo:chebi/smiles: OC(CC([O-])=O)(CC([O-])=O)C([O-])=O
oio:hasAlternativeId:
- CHEBI:13999
- CHEBI:23321
- CHEBI:42563
oio:hasDbXref:
- Beilstein:1884707
- CAS:126-44-3
- Gmelin:4239
- KEGG:C00158
- PDBeChem:FLC
- Reaxys:1884707
oio:hasExactSynonym: 2-hydroxypropane-1,2,3-tricarboxylate
oio:hasOBONamespace: chebi_ontology
oio:hasRelatedSynonym:
- 2-hydroxy-1,2,3-propanetricarboxylate
- 2-hydroxy-1,2,3-propanetricarboxylate(3-)
- 2-hydroxy-1,2,3-propanetricarboxylic acid, ion(3-)
- 2-hydroxytricarballylate
- CITRATE ANION
- cit
- cit(3-)
- citrate
oio:id: CHEBI:16947
oio:inSubset: obo:chebi#3_STAR
rdfs:label: citrate(3-)

---

Computing similarity¶

There are various ways to measure chemical similarity.

Here we are using an ontology library, not a chemical library like RDKit, so we can measure similarity with respect to their shared parentage.

We will use the similarity command that measures semantic similarity.

Like many OAK operations, it is parameterized by a predicates option. For example:

--predicates rdfs:subClassOf

This can be shortened to:

-p i

This instructs OAK to use only the is-a relationship when computing parentage.

Note that many libraries don't provide any option here, and only allow is-a relationships

The similarity command takes two term lists separated by @ - here we just want to do a simple pairwise comparison, we specify one term either side:

In [8]:

chebi similarity -p i "citrate(3-)" @ "citric acid"

ancestor_id: CHEBI:37577
ancestor_information_content: 1.792616548579986
ancestor_label: heteroatomic molecular entity
jaccard_similarity: 0.25
object_id: CHEBI:30769
object_label: citric acid
phenodigm_score: 0.6694431545284457
subject_id: CHEBI:16947
subject_label: citrate(3-)

---

What is this telling us?

the jaccard similarity is 0.25, which is very low
the most recent common ancestor is the very general and abstract sounding heteroatomic molecular entity, which has a low information content of 1.8

Why is the score so low?

Remember at the start of this guide we looked at the chemical structures, which are almost identical. And biologically these are interchangeable. Why is the similarity so low?

Investigating ontological oddities is one of the strengths of OAK. We can take a number of different approaches, but the easiest is to start by just visualizing the terms and their ancestors:

In [9]:

chebi viz -p i "citrate(3-)" "citric acid" -o output/citrate.png

As can be seen, the is-a graphs of these two terms are almost completely separated. You have to go all the way up to heteroatomic molecular entity to find the common ancestor, just like the similarity output told us.

So what's going on?¶

So what's going on here? is there something missing from CHEBI?

In fact, CHEBI is like this by design, and we see the same pattern/template repeated for all acids.

But all is not lost, CHEBI has other relationships we can use here.

Which leads us to one of the main lessons when using ontologies:

Always make use of the full range of edge types¶

CHEBI has many other edge types we can use here.

Currently OAK doesn't have a quick way of summarizing edge statistics, but we can do this easily with a SQL query on the sqlite database we downloaded earlier, querying the Edge table:

In [3]:

!echo "SELECT predicate, count(*) FROM edge GROUP BY predicate;"  | sqlite3 $HOME/.data/oaklib/chebi.db

BFO:0000051|3947
RO:0000087|42533
obo:chebi#has_functional_parent|18459
obo:chebi#has_parent_hydride|1752
obo:chebi#is_conjugate_acid_of|8340
obo:chebi#is_conjugate_base_of|8340
obo:chebi#is_enantiomer_of|2700
obo:chebi#is_substituent_group_from|1279
obo:chebi#is_tautomer_of|1846
rdfs:subClassOf|235113

CHEBI mostly uses it's own relationship types, and a few from RO, we can query what these are:

In [6]:

chebi info BFO:0000051 RO:0000087

BFO:0000051 ! has part
RO:0000087 ! has role

Next let's try again, using the viz command, but this time adding a different predicate.

We can specify a list of predicates separated by , with the --predicates option on most commands:

In [9]:

chebi viz -p "i,obo:chebi#is_conjugate_acid_of" "citrate(3-)" "citric acid" -o output/citrate-conj-acid-of.png

This time the terms are much closer together. However, they are not "next" to each other, which brings us to another lesson:

Number of hops is often meaningless with ontologies¶

A common metric with graph operations is counting number of hops. However, for knowledge graphs, this metric can be misleading or meaningless. It may be tempting to do something like "weighting" predicates but this is always ad-hoc.

With ontologies, predicates have meaning and we want this to be take into account.

Calculating similarity using all predicates¶

In OAK the default is usually to use all predicates

Thus if we simply ask for the overall similarity between citric acid and the 3- form, i.e via:

In [11]:

chebi similarity "citrate(3-)" @ "citric acid"

ancestor_id: CHEBI:133748
ancestor_information_content: 12.881018338799942
ancestor_label: citrate anion
jaccard_similarity: 0.6526315789473685
object_id: CHEBI:30769
object_label: citric acid
phenodigm_score: 2.899406721538221
subject_id: CHEBI:16947
subject_label: citrate(3-)

---

This is much better than before, reflecting the true biochemical similarity between these.

The jaccard similarity is 0.65, still not great
the MRCA is the more meaningful citrate anion which has a higher IC of 12.88

So how did OAK calculate this?

Here OAK made use of all edge types in the CHEBI relation graph. This is in contrast to many methods that only make use of is-a relationships. This might be a good baked in assumption if you are only doing similarity on HPO (but even then it can be limiting).

For other ontologies, we need to make use of other predicates.

At this stage you may be thinking: "Ah! All ontologies are DAGs, so OAK is using the CHEBI DAG here!"

This brings us to our next point

Ontologies are not DAGs¶

This is a common misconception. Ontologies are not DAGs, no matter what you may have previously heard.

And in general avoid baking in assumptions generalized from a few cases when it comes to ontologies

Often ontologies will be released in a form that is guaranteed to be a DAG because so many tools mistakenly assume an ontology is a DAG. But if you are using one of these dumbed down forms of an ontology you are missing useful information.

Let's take a look at CHEBI again. This time we will use two different relationship types (predicates), and exclude is-a:

In [13]:

chebi viz CHEBI:16947 CHEBI:30769 -p "obo:chebi#is_conjugate_acid_of,obo:chebi#is_conjugate_base_of" -o output/citrate-not-a-dag.png

This is definitely not a DAG.

In fact there is no reason to assume that for an ontology or a knowledge graph the structure will be a DAG. A lot of relationships in real life are inherently cyclic, and this definitely holds for chemistry, we have cyclic structures and cyclic relationships, and chemicals cycle through these different protonation states.

So how do we handle these?¶

At this point you might be thinking it makes no sense to use measures like semantic similarity over cyclic graphs. Or that it may be necessary to include ad-hoc measures like maximim distance. But this isn't the case

Because ontology graphs are existential graphs over concepts, where the existence of the subject depends on the existence of the object, you can still use algorithms designed with concepts of "ancestors" and "descendants". The overall structure of the relation graph will still (in general) follow a pattern of narrowing down to more general concepts.

There are two broad approaches:

naive graph walking, with cycle checks
use the relation graph

The first is trivial to implement, just implement traversal as you normally would, but remove the assumption of acyclicity.

The OAK sqlite adapter uses the 2nd approach, making use of relation graph

Relation Graph¶

Relation graph is a tool for calculating the closure of ontology relationships. Unlike naive graph walking, it takes into account the semantics of the ontology and of predicates as intended by the producers of these ontologies

More formally, RG materializes the entailment of all SubClassOf axioms, including those axioms that have existential restrictions on the right hand side.

Relation graph can be obtained and installed from its github repo.

In [15]:

!relation-graph --help

relation-graph
Usage: relation-graph [options]
  --usage  <bool>
        Print usage and exit
  --help | -h  <bool>
        Print help message and exit
  --ontology-file  <filename>
        Input OWL ontology
  --output-file  <filename>
        File to stream output triples to.
  --mode  <RDF|OWL>
        Configure style of triples to be output. RDF mode is the default; each existential relation is collapsed to a single direct triple.
  --property  <IRI>
        Property to restrict output relations to. Provide option multiple times for multiple properties.
  --properties-file  <filename>
        File containing line-separated property IRIs to restrict output relations to.
  --output-subclasses  <bool>
        Include entailed rdfs:subClassOf or owl:equivalentClass relations in output (default false)
  --reflexive-subclasses  <bool>
        When outputting rdfs:subClassOf, include relations to self for every class (default true)
  --equivalence-as-subclass  <bool>
        When outputting equivalent classes, output reciprocal rdfs:subClassOf triples instead of owl:equivalentClass triples (default true)
  --output-classes  <bool>
        Output any triples where classes are subjects (default true)
  --output-individuals  <bool>
        Output triples where individuals are subjects, with classes as objects (default false)
  --disable-owl-nothing  <bool>
        Disable inference of unsatisfiable classes by the whelk reasoner (default false)
  --verbose  <bool>
        Set log level to INFO

SemSQL Builds have relation-graph pre-computed¶

If you access an ontology via the sqlite method, it will make use of an ontology loaded in using the SemSQL schema, which has relation-graph precomputed.

We can take a look first at the CL

In [17]:

%alias cl runoak -i sqlite:obo:cl

The OAK relationships command will query all relationships (by default "outgoing") for an entity.

If you pass in --include-entailed it will include entailed (inferred by reasoner) relationships, here coming from RG:

In [24]:

cl relationships --include-entailed astrocyte > output/astrocyte-rg.tsv

The size of a RG can be large so we will explore it with pandas:

In [26]:

import pandas as pd
df = pd.read_csv("output/astrocyte-rg.tsv", sep="\t")
df

Out[26]:

	subject	predicate	object	subject_label	predicate_label	object_label
0	CL:0000127	BFO:0000050	BFO:0000002	astrocyte	part of	continuant
1	CL:0000127	BFO:0000050	BFO:0000004	astrocyte	part of	independent continuant
2	CL:0000127	BFO:0000050	BFO:0000040	astrocyte	part of	material entity
3	CL:0000127	BFO:0000050	CARO:0000000	astrocyte	part of	anatomical entity
4	CL:0000127	BFO:0000050	CARO:0000006	astrocyte	part of	material anatomical entity
...	...	...	...	...	...	...
299	CL:0000127	rdfs:subClassOf	CL:0000127	astrocyte	None	astrocyte
300	CL:0000127	rdfs:subClassOf	CL:0000255	astrocyte	None	eukaryotic cell
301	CL:0000127	rdfs:subClassOf	CL:0000548	astrocyte	None	animal cell
302	CL:0000127	rdfs:subClassOf	CL:0002319	astrocyte	None	neural cell
303	CL:0000127	rdfs:subClassOf	CL:0002371	astrocyte	None	somatic cell

304 rows × 6 columns

These are all guaranteed correct (according to the semantics of classes and object properties in the ontology)

They are not guaranteed useful. There are many trivial edges, eg.

every astrocyte is part of some material entity
every astrocyte is a subtype of a continuant

But these less useful ones will "fall out in the wash" when we use them in methods like semantic similarity

We can query the EntailedEdge table more directly:

In [30]:

!echo "SELECT * FROM entailed_edge WHERE predicate='BFO:0000050' AND object LIKE 'UBERON:%' LIMIT 5"  | sqlite3 $HOME/.data/oaklib/cl.db

UBERON:0001612|BFO:0000050|UBERON:0000477
UBERON:0001612|BFO:0000050|UBERON:0010000
UBERON:0001612|BFO:0000050|UBERON:0000055
UBERON:0001612|BFO:0000050|UBERON:0003509
UBERON:0001612|BFO:0000050|UBERON:0001637

OAK uses Relation Graph in semantic similarity¶

The OAK SQL backend will use RG when calculating semantic similarity.

there is no need to worry whether the structure is a DAG or a tree
no need to implement "hops" or ad-hoc mechanisms

All semantic similarity measures then become trivial operations on RG, parameterizable by a semantic predicate

Jaccard is simply the intersection of common ancestors in the RG divided by the union of ancestors
MRCA, IC etc work as expected

Relation Graph is only as good as its inputs¶

Garbage in, Garbage out

if ontologies include false axioms, RG will give false results
if ontologies are incomplete, RG will give incomplete answers

Usually for most ontologies we care about in a project like Monarch, we have decent QC in place, and in general most methods should be resilient to this.

In [ ]: