This notebook is intended as an explanatory guide to the importance of edge types (predicates) in ontologies.
For this guide, we are going to look at citric acid and it's conjugate forms such as citrate(3-). These chemical entities are very similar and in fact are readily interchangeable in cells.
Biologists and biochemists may talk of "citric acid" and "citrate" interchangeably.
We can see the corresponding CHEBI entries:
There are different ways to access CHEBI, we will use the sqlite adapter. See also part 7 of the tutorial.
We will use the selector sqlite:obo:chebi
to access CHEBI.
We will be using the command line interface via Jupyter for this tutortial, but the equivalent operations can be done via Python.
First we will set up a Jupyter alias.
%alias chebi runoak -i sqlite:obo:chebi
If we wanted to do the equivalen on the command line, we would do:
alias chebi="runoak -i sqlite:obo:chebi"
Next we will do some basic lookup. The first time you run this may take some time, as the sqlite file is downloaded. Subsequent operations will be faster.
chebi info "citric acid"
CHEBI:30769 ! citric acid
To check we have the right term, let's look at all of the CHEBI metadata, including mappings and chemical formulae:
chebi term-metadata "citric acid"
IAO:0000115: A tricarboxylic acid that is propane-1,2,3-tricarboxylic acid bearing a hydroxy substituent at position 2. It is an important metabolite in the pathway of all aerobic organisms. id: CHEBI:30769 obo:chebi/charge: '0' obo:chebi/formula: C6H8O7 obo:chebi/inchi: InChI=1S/C6H8O7/c7-3(8)1-6(13,5(11)12)2-4(9)10/h13H,1-2H2,(H,7,8)(H,9,10)(H,11,12) obo:chebi/inchikey: KRKNYBCHXYNGOX-UHFFFAOYSA-N obo:chebi/mass: '192.123' obo:chebi/monoisotopicmass: '192.02700' obo:chebi/smiles: OC(=O)CC(O)(CC(O)=O)C(O)=O oio:hasAlternativeId: - CHEBI:23322 - CHEBI:3727 - CHEBI:41523 oio:hasDbXref: - BPDB:1359 - Beilstein:782061 - CAS:77-92-9 - DrugBank:DB04272 - Drug_Central:666 - Gmelin:4240 - HMDB:HMDB0000094 - KEGG:C00158 - KEGG:D00037 - KNApSAcK:C00007619 - MetaCyc:CIT - PDBeChem:CIT - PMID:11762832 - PMID:11782123 - PMID:11857437 - PMID:14537820 - PMID:15311880 - PMID:15934243 - PMID:16232627 - PMID:17190852 - PMID:17357118 - PMID:17604395 - PMID:18298573 - PMID:18960216 - PMID:19288211 - PMID:22115968 - PMID:22192423 - PMID:22264346 - PMID:22373571 - PMID:22509852 - Reaxys:782061 - Wikipedia:Citric_Acid oio:hasExactSynonym: - 2-hydroxypropane-1,2,3-tricarboxylic acid - CITRIC ACID - Citric acid oio:hasOBONamespace: chebi_ontology oio:hasRelatedSynonym: - 2-Hydroxy-1,2,3-propanetricarboxylic acid - 2-Hydroxytricarballylic acid - 3-Carboxy-3-hydroxypentane-1,5-dioic acid - Citronensaeure - E330 - H3cit oio:id: CHEBI:30769 oio:inSubset: obo:chebi#3_STAR rdfs:label: citric acid ---
We can do the same thing for the same chemical in a different protonation state, citrate(3-)
:
chebi term-metadata "citrate(3-)"
IAO:0000115: A tricarboxylic acid trianion, obtained by deprotonation of the three carboxy groups of citric acid. id: CHEBI:16947 obo:chebi/charge: '-3' obo:chebi/formula: C6H5O7 obo:chebi/inchi: InChI=1S/C6H8O7/c7-3(8)1-6(13,5(11)12)2-4(9)10/h13H,1-2H2,(H,7,8)(H,9,10)(H,11,12)/p-3 obo:chebi/inchikey: KRKNYBCHXYNGOX-UHFFFAOYSA-K obo:chebi/mass: '189.09970' obo:chebi/monoisotopicmass: '189.00517' obo:chebi/smiles: OC(CC([O-])=O)(CC([O-])=O)C([O-])=O oio:hasAlternativeId: - CHEBI:13999 - CHEBI:23321 - CHEBI:42563 oio:hasDbXref: - Beilstein:1884707 - CAS:126-44-3 - Gmelin:4239 - KEGG:C00158 - PDBeChem:FLC - Reaxys:1884707 oio:hasExactSynonym: 2-hydroxypropane-1,2,3-tricarboxylate oio:hasOBONamespace: chebi_ontology oio:hasRelatedSynonym: - 2-hydroxy-1,2,3-propanetricarboxylate - 2-hydroxy-1,2,3-propanetricarboxylate(3-) - 2-hydroxy-1,2,3-propanetricarboxylic acid, ion(3-) - 2-hydroxytricarballylate - CITRATE ANION - cit - cit(3-) - citrate oio:id: CHEBI:16947 oio:inSubset: obo:chebi#3_STAR rdfs:label: citrate(3-) ---
There are various ways to measure chemical similarity.
Here we are using an ontology library, not a chemical library like RDKit, so we can measure similarity with respect to their shared parentage.
We will use the similarity command that measures semantic similarity.
Like many OAK operations, it is parameterized by a predicates option. For example:
--predicates rdfs:subClassOf
This can be shortened to:
-p i
This instructs OAK to use only the is-a relationship when computing parentage.
Note that many libraries don't provide any option here, and only allow is-a relationships
The similarity command takes two term lists separated by @
- here we just want to do a simple pairwise comparison, we specify one term either side:
chebi similarity -p i "citrate(3-)" @ "citric acid"
ancestor_id: CHEBI:37577 ancestor_information_content: 1.792616548579986 ancestor_label: heteroatomic molecular entity jaccard_similarity: 0.25 object_id: CHEBI:30769 object_label: citric acid phenodigm_score: 0.6694431545284457 subject_id: CHEBI:16947 subject_label: citrate(3-) ---
What is this telling us?
heteroatomic molecular entity
, which has a low information content of 1.8Why is the score so low?
Remember at the start of this guide we looked at the chemical structures, which are almost identical. And biologically these are interchangeable. Why is the similarity so low?
Investigating ontological oddities is one of the strengths of OAK. We can take a number of different approaches, but the easiest is to start by just visualizing the terms and their ancestors:
chebi viz -p i "citrate(3-)" "citric acid" -o output/citrate.png
As can be seen, the is-a graphs of these two terms are almost completely separated. You have to go all the way up to heteroatomic molecular entity
to find the common ancestor, just like the similarity output told us.
So what's going on here? is there something missing from CHEBI?
In fact, CHEBI is like this by design, and we see the same pattern/template repeated for all acids.
But all is not lost, CHEBI has other relationships we can use here.
Which leads us to one of the main lessons when using ontologies:
CHEBI has many other edge types we can use here.
Currently OAK doesn't have a quick way of summarizing edge statistics, but we can do this easily with a SQL query on the sqlite database we downloaded earlier, querying the Edge table:
!echo "SELECT predicate, count(*) FROM edge GROUP BY predicate;" | sqlite3 $HOME/.data/oaklib/chebi.db
BFO:0000051|3947 RO:0000087|42533 obo:chebi#has_functional_parent|18459 obo:chebi#has_parent_hydride|1752 obo:chebi#is_conjugate_acid_of|8340 obo:chebi#is_conjugate_base_of|8340 obo:chebi#is_enantiomer_of|2700 obo:chebi#is_substituent_group_from|1279 obo:chebi#is_tautomer_of|1846 rdfs:subClassOf|235113
CHEBI mostly uses it's own relationship types, and a few from RO, we can query what these are:
chebi info BFO:0000051 RO:0000087
BFO:0000051 ! has part RO:0000087 ! has role
Next let's try again, using the viz command, but this time adding a different predicate.
We can specify a list of predicates separated by ,
with the --predicates
option on most commands:
chebi viz -p "i,obo:chebi#is_conjugate_acid_of" "citrate(3-)" "citric acid" -o output/citrate-conj-acid-of.png
This time the terms are much closer together. However, they are not "next" to each other, which brings us to another lesson:
A common metric with graph operations is counting number of hops. However, for knowledge graphs, this metric can be misleading or meaningless. It may be tempting to do something like "weighting" predicates but this is always ad-hoc.
With ontologies, predicates have meaning and we want this to be take into account.
In OAK the default is usually to use all predicates
Thus if we simply ask for the overall similarity between citric acid and the 3- form, i.e via:
chebi similarity "citrate(3-)" @ "citric acid"
ancestor_id: CHEBI:133748 ancestor_information_content: 12.881018338799942 ancestor_label: citrate anion jaccard_similarity: 0.6526315789473685 object_id: CHEBI:30769 object_label: citric acid phenodigm_score: 2.899406721538221 subject_id: CHEBI:16947 subject_label: citrate(3-) ---
This is much better than before, reflecting the true biochemical similarity between these.
citrate anion
which has a higher IC of 12.88So how did OAK calculate this?
Here OAK made use of all edge types in the CHEBI relation graph. This is in contrast to many methods that only make use of is-a relationships. This might be a good baked in assumption if you are only doing similarity on HPO (but even then it can be limiting).
For other ontologies, we need to make use of other predicates.
At this stage you may be thinking: "Ah! All ontologies are DAGs, so OAK is using the CHEBI DAG here!"
This brings us to our next point
This is a common misconception. Ontologies are not DAGs, no matter what you may have previously heard.
And in general avoid baking in assumptions generalized from a few cases when it comes to ontologies
Often ontologies will be released in a form that is guaranteed to be a DAG because so many tools mistakenly assume an ontology is a DAG. But if you are using one of these dumbed down forms of an ontology you are missing useful information.
Let's take a look at CHEBI again. This time we will use two different relationship types (predicates), and exclude is-a:
chebi viz CHEBI:16947 CHEBI:30769 -p "obo:chebi#is_conjugate_acid_of,obo:chebi#is_conjugate_base_of" -o output/citrate-not-a-dag.png
This is definitely not a DAG.
In fact there is no reason to assume that for an ontology or a knowledge graph the structure will be a DAG. A lot of relationships in real life are inherently cyclic, and this definitely holds for chemistry, we have cyclic structures and cyclic relationships, and chemicals cycle through these different protonation states.
At this point you might be thinking it makes no sense to use measures like semantic similarity over cyclic graphs. Or that it may be necessary to include ad-hoc measures like maximim distance. But this isn't the case
Because ontology graphs are existential graphs over concepts, where the existence of the subject depends on the existence of the object, you can still use algorithms designed with concepts of "ancestors" and "descendants". The overall structure of the relation graph will still (in general) follow a pattern of narrowing down to more general concepts.
There are two broad approaches:
The first is trivial to implement, just implement traversal as you normally would, but remove the assumption of acyclicity.
The OAK sqlite adapter uses the 2nd approach, making use of relation graph
Relation graph is a tool for calculating the closure of ontology relationships. Unlike naive graph walking, it takes into account the semantics of the ontology and of predicates as intended by the producers of these ontologies
More formally, RG materializes the entailment of all SubClassOf axioms, including those axioms that have existential restrictions on the right hand side.
Relation graph can be obtained and installed from its github repo.
!relation-graph --help
relation-graph Usage: relation-graph [options] --usage <bool> Print usage and exit --help | -h <bool> Print help message and exit --ontology-file <filename> Input OWL ontology --output-file <filename> File to stream output triples to. --mode <RDF|OWL> Configure style of triples to be output. RDF mode is the default; each existential relation is collapsed to a single direct triple. --property <IRI> Property to restrict output relations to. Provide option multiple times for multiple properties. --properties-file <filename> File containing line-separated property IRIs to restrict output relations to. --output-subclasses <bool> Include entailed rdfs:subClassOf or owl:equivalentClass relations in output (default false) --reflexive-subclasses <bool> When outputting rdfs:subClassOf, include relations to self for every class (default true) --equivalence-as-subclass <bool> When outputting equivalent classes, output reciprocal rdfs:subClassOf triples instead of owl:equivalentClass triples (default true) --output-classes <bool> Output any triples where classes are subjects (default true) --output-individuals <bool> Output triples where individuals are subjects, with classes as objects (default false) --disable-owl-nothing <bool> Disable inference of unsatisfiable classes by the whelk reasoner (default false) --verbose <bool> Set log level to INFO
If you access an ontology via the sqlite method, it will make use of an ontology loaded in using the SemSQL schema, which has relation-graph precomputed.
We can take a look first at the CL
%alias cl runoak -i sqlite:obo:cl
The OAK relationships
command will query all relationships (by default "outgoing") for an entity.
If you pass in --include-entailed
it will include entailed (inferred by reasoner) relationships,
here coming from RG:
cl relationships --include-entailed astrocyte > output/astrocyte-rg.tsv
The size of a RG can be large so we will explore it with pandas:
import pandas as pd
df = pd.read_csv("output/astrocyte-rg.tsv", sep="\t")
df
subject | predicate | object | subject_label | predicate_label | object_label | |
---|---|---|---|---|---|---|
0 | CL:0000127 | BFO:0000050 | BFO:0000002 | astrocyte | part of | continuant |
1 | CL:0000127 | BFO:0000050 | BFO:0000004 | astrocyte | part of | independent continuant |
2 | CL:0000127 | BFO:0000050 | BFO:0000040 | astrocyte | part of | material entity |
3 | CL:0000127 | BFO:0000050 | CARO:0000000 | astrocyte | part of | anatomical entity |
4 | CL:0000127 | BFO:0000050 | CARO:0000006 | astrocyte | part of | material anatomical entity |
... | ... | ... | ... | ... | ... | ... |
299 | CL:0000127 | rdfs:subClassOf | CL:0000127 | astrocyte | None | astrocyte |
300 | CL:0000127 | rdfs:subClassOf | CL:0000255 | astrocyte | None | eukaryotic cell |
301 | CL:0000127 | rdfs:subClassOf | CL:0000548 | astrocyte | None | animal cell |
302 | CL:0000127 | rdfs:subClassOf | CL:0002319 | astrocyte | None | neural cell |
303 | CL:0000127 | rdfs:subClassOf | CL:0002371 | astrocyte | None | somatic cell |
304 rows × 6 columns
These are all guaranteed correct (according to the semantics of classes and object properties in the ontology)
They are not guaranteed useful. There are many trivial edges, eg.
But these less useful ones will "fall out in the wash" when we use them in methods like semantic similarity
We can query the EntailedEdge table more directly:
!echo "SELECT * FROM entailed_edge WHERE predicate='BFO:0000050' AND object LIKE 'UBERON:%' LIMIT 5" | sqlite3 $HOME/.data/oaklib/cl.db
UBERON:0001612|BFO:0000050|UBERON:0000477 UBERON:0001612|BFO:0000050|UBERON:0010000 UBERON:0001612|BFO:0000050|UBERON:0000055 UBERON:0001612|BFO:0000050|UBERON:0003509 UBERON:0001612|BFO:0000050|UBERON:0001637
The OAK SQL backend will use RG when calculating semantic similarity.
All semantic similarity measures then become trivial operations on RG, parameterizable by a semantic predicate
Garbage in, Garbage out
Usually for most ontologies we care about in a project like Monarch, we have decent QC in place, and in general most methods should be resilient to this.