This notebook contains proofs of certain lemmas needed to ascertain that the search algorithm works correctly. They are referenced from the article that explains the algorithm:
Our search algorithm consists of several parts, the main ones of which are
We will make these terms precise now, and prove essential properties, leading to insight in the correctness of our algorithm, and the relative merits (and demerits) of it.
Let us define all these textile, woolly, fluffy concepts into hard formal language.
We have a dataset which is a graph consisting of text nodes and text edges, in short, nodes and edges. Nodes are textual objects, edges are relations between them.
Searching is: looking for patterns of nodes and edges. The pattern is a template, describing nodes to be instantiated, constrained by relations that correspond to the edges in the graph.
A query is a graph of query nodes and query edges. We refer to the query nodes as $$q_1, ... , q_n$$ and to the query edges as $$e_1, ... , e_k$$ where every $$e_j = (R_j, q_f, q_t)$$ for some $f$ and $t$ in $1, ..., n$, where $R_j$ is a relationship between text nodes. That means, if $s$ and $u$ are text nodes, $sR_ju$ is either true or false.
A query node $q_i$ consists of a node type of text nodes and a features dict. The features dict specifies any number of data features, with an allowed value or set of allowed values for each of them.
The set of all text nodes of a node type is called the fleece of that node type.
The fleece of a query node, is the fleece of the node type that is associated with that query node.
If we filter the fleece of a node by some process, we call the result set a yarn of that query node.
The thick yarn of a query node consists of the fleece of the query node, filtered by the features dict of the query node.
If we have an edge $e_j = (R_j, q_f, q_t)$ and if we have yarns $y_f, y_t$ for $q_f$, $q_t$, then we turn the edge $e_j$ to spin the yarns $y_f$, $y_t$ further.
The result of this spinning are two new yarns, $z_f, z_t$, obtained as follows:
we iterate through the text nodes $s$ in yarn $y_i$:
Otherwise we do not add nodes to $z_i$ nor to $z_j$.
So spinning means thinning out the yarns of two query nodes that are connected to a query edge, in such a way, that afterwards the relationship that is associated with the edge, can be realized for every text node in both yarns. In other words: after spinning, every text node in a yarn involved, has at least one counterpart text node in the other yarn such that both are in relationship which each other. That is, the relationship belonging to the edge.
So, every spinning action is caused by a turn of the edge, which acts as the spinning wheel.
A proper yarn of a query node is either its principal yarn or any yarn that results from turning edges between proper yarns. So if we start with the principal yarn of a query node, and start spinning, we produce proper yarns for that query node.
A result of a query is a list of text nodes $$s_1, ... , s_n$$ satisfying the following conditions:
If this is the case, we call text node $s_i$ a stitch for query node $q_i$.
A result can be seen as a stitching together of the principal yarns of the query nodes, in such a way, that all constraints specified by the query edges are satisfied. The stitches are those nodes in the text, one for each yarn, that together constitute a result.
If a yarn of a query node has become so thin, that it fails to contain stitches of of the result of the query, we call the yarn overspun.
If a yarn of a query node contains nodes that are not stitches of any result of the query, we call the yarn underspun.
We can now use this fabric language to formulate our query algorithm and prove essential properties of it.
Here we describe the starting condition for our algorithm, what happens at each iteration, and the conditions at termination.
Here is the preparation for running a sequence of iterations. In those iterations we turn the edges to spin yarns.
First of all, for every query node, we collect its fleece, and spin its principal yarn.
we compute edge constraints and filter the result sets of text nodes.
we collect all text nodes of the node type associated with the query node
(the fleece), and we apply the criteria defined by the features of the query node (from fleece to principal yarn).
We give every edge a the status: not up-to-date. The intention is to give an edge the status up-to-date if it has just been turned, and no turnings of other edges have since interfered with the yarns involved, so the yarns still reflect the effects of turning the edge.
Initially, no edge has been turned, so every edge starts out as not up-to-date.
When query edges are computed, yarns are spun, some edges become up-to-date, others get not up-to-date.
Here is when we stop turning the edges. We stop when continuing does not make sense anymore, and that happens if one of the following conditions occur. Whether we have correct and complete results if we stop, is something that remains to be seen. We'll prove it later.
If a yarn becomes empty, we can stop: there are no results at all. The combined turning of the edges has spun the yard into nothing: the thread has broken. Every yarn must be part of the result, so if a yarn breaks, the result is gone.
The constraints expressed by the query edges are such, that one of the query nodes
cannot be instantiated by a text node. Since every query result must instantiate all query nodes by a text node, we do not have a result.
If all edges are up-to-date, it does not make sense to turn edges anymore. So we stop.
Turning an up-to-date edge means recomputing the constraints posed by that edge
on node sets on which it has already been computed. There will be no effect.
The main questions are:
This fundamental question can be split into lesser ones.
The validity of the results delivered should not dependent on the strategy of selecting edges for computation. But it is easy to come up with a stupid strategy that will not terminate or not arrive at desired results. For example, if we always select the first edge, it is clear that this is in general a very unreliable strategy.
So we suppose that our strategy of edge selection for computation satisfies a minimal sanity requirement. We can formulate that exactly:
A strategy is thorough if it always selects a non-up-to-date edge if there is one.
This is really a minimal criterion, since selecting an up-to-date edge is always a waste of time. And if there are no up-to-date edges, it is time to stop anyway.
The algorithm as defined above does not deliver a list of results, results being stitches. It only delivers the yarns to be stitched.
The algorithm yields a list of text node sets
instantiations of all query nodes by text nodes in such a way that all individual query edges are satisfied. Which combinations of these text nodes constitute results, remains to be established.
Instead, we assert that after the process, under some conditions, every yarn is fully spun, so none of them is underspun and none of them is overspun. We also prove that in general none of the yards are overspun, but that we cannot rule out underspun yarns in general.
The claim is that every node set has only nodes that occur in the real results, and that
all nodes that occur in the real results, do occur in the proper node sets (the latter only under some conditions).
Under every thorough strategy, the iteration always terminates.
Proof:
If an edge is turned, no yarn becomes longer. It is possible, however, that no yarn shrinks.
For every turn of an edge there are two cases to consider:
a. No yarn becomes shorter:
Then the amount of up-to-date edges increases by one.
This is so, because the recently turned edge becomes up-to-date, and no other edges
become non-up-to-date.
b. One or more yarns become shorter.
Now suppose that that the sequence of edge turnings never stops.
Mark every step as a
if case a. applies and as b
if case b. applies.
Our infinite turning is then an infinite sequence of a
and b
.
In that sequence there cannot be infinitely many b
, because every b
corresponds to
a shortening of the total amount of yarn, and the total length of yarn is finite.
So there must be infinitely many a
.
That means that from some point onward, we see only a
and never a b
anymore.
But every a
decreases the number of non-up-to-date edges, so a sequence of a
is
bound to stop.
Ergo: the sequence cannot be infinite and the computation terminates.
Turning of edges does not cause overspinning: no yarn gets overspun.
Proof:
Suppose none of the yarns is overspun yet, and we turn an edge $e = (R_j, q_f, q_t)$ on yarns $y_f, y_t$ resulting in yarns $z_f, z_t$.
We assume $y_f$ and $y_t$ are not overspun.
Case a: Suppose $z_f$ is overspun.
(i) Then there must be a result text node $s$ of query node $q_f$ in $y_f$ that does not reappear in $z_f$.
Because $s$ is in the result, there must be a text node $u$ in the result of query node $q_t$, such that $sR_ju$.
That means that $u$ belongs to every yarn of $q_t$ that is not overspun. By hypothesis, $y_t$ is not overspun, so $u$ is member of $y_t$.
By following the definition of turning an edge, we see that the existence of this $u$ will cause $s$ to be added to $z_f$ which conflicts with (i).
So case a. does not happen.
Case b: Suppose $z_t$ is overspun.
(ii) Then there must be a result text node $u$ of query node $q_t$ in $y_t$ that does not reappear in $z_t$.
Because $u$ is in the result, there must be a text node $s$ in the result of query node $q_f$, such that $sR_ju$.
That means that $s$ belongs to every yarn of $q_f$ that is not overspun. By hypothesis, $y_f$ is not overspun, so $s$ is member of $y_f$.
By following the definition of turning an edge, we see that the existence of this $s$ will cause $u$ to be added to $z_t$ which conflicts with (ii).
No more cases: $z_f$ and $z_t$ are not overspun. QED
If, after some turning, a yarn becomes empty, the query as a whole has no results.
Proof:
If at some point a yarn, say $y_i$ becomes empty, we know by lemma 2, that it is still not overspun. That means that all nodes that occur in a result and that correspond with query node $q_i$, are contained in this empty yarn $y_i$.
Hence there are no results, QED.
A connected component of a graph is a subgraph satisfying two conditions:
Every graph can be divided into a number of disjunct connected components.
Proof:
See the literature on graph theory.
The results of a query is essentially the set consisting of the cartesian product of the query results of its connected component queries.
Proof:
We prove first:
If a query can be split into two subparts between which there are no edges, then the results of the whole query can be seen as the cartesian product of the results of the two subparts.
Proof:
Say the query is a graph $G$ consisting of $G_1 = (Q_1, E_1)$ and $G_2 = (Q_2, E_2)$, with no edges between $Q_1$ and $Q_2$ and with result sets $R_1$ and $R_2$.
Say there are $m_1$ nodes in $Q_1$ and $m_2$ in $Q_2$.
Then $R_1$ is a set of $m_1$ tuples of text nodes, and $R_2$ is a set of $m_2$ tuples of text nodes.
Every $r_1$ in $R_1$ is a $m_1$-tuple of text nodes satisfying $G_1$.
Every $r_2$ in $R_2$ is a $m_2$-tuple of text nodes satisfying $G_2$.
Then, for every combination of such an $r_1$ and $r_2$, the concatenation of $r_1$ and $r_2$ a $m_1+m_2$-tuple of text nodes.
The first $m_1$ text nodes instantiate the query nodes of $G_1$ and satisfy all edge constraints of $G_1$.
They do not have to satisfy the query nodes in $G_2$ in order to be valid for $G$ as a whole, because they do not correspond to them.
They are not influenced by the constraints of the edges in $G_2$, because these edges do not reach the nodes of $G_1$.
Analogous for the last $m_2$ nodes in such a combination.
Hence the combination is a result of the whole graph.
Conversely, every result of the whole graph is a tuple that can be decomposed in an $r_1$ which is a result of $G_1$ and an $r_2$ which is a result of $G_2$.
QED (lemma 4a)
Now we can prove by induction on the number of connected components of a query that its results can be seen as the cartesian product of the results of its connected components.
Case a:
The query consists of just one connected component: the lemma is trivially true.
Case b: The query consists of $n+2$ connected components, $n > 0$.
We assume by way of induction hypothesis that the lemma holds for all cases up to $n+1$.
Say $G = (G_1 + ... + G_{n+2}, E_1, ... , E_{n+2})$ with result set $R$, where each $(G_i, E_i)$ is a connected component with result set $R_i$.
Say $H = (G_1 + ... + G_{n+1}, E_1, ... , E_{n+1})$ then we know by induction hypothesis that the results of H are $R_1 \times ... \times R_{n+1}$.
Now $H$ and $G_{n+2}$ satisfy the conditions of lemma 4a, to the results of
$G = H + (G_{n+2}, E_{n+2})$
is $(R_1 \times ... \times R_{n+1}) \times R_{n+2}$
is $R_1 \times ... \times R_{n+1} \times R_{n+2}$
QED (lemma 4)
A cycle in a query is a list of query nodes connected by edges, where the first node is equal to the last node. Here the edges are taken in the directional sense.
Here is a graph with a cycle: $$G = (\{q_1, q_2\}, \{(R, q_1, q_2), (S, q_2, q_1)\})$$ But this is not a cycle: $$G = (\{q_1, q_2\}, \{(R, q_1, q_2), (S, q_1, q_2)\})$$
For queries consisting of just one connected component:
if the query does not contain cycles:
if, after some spinning, all edges are up-to-date, then all yarns are fully spun.
Proof: Because the yarns are never overspun by lemma 2, we must prove that when all edges are up-to-date, none of the yarns are underspun.
Suppose we are in the situation that all edges are up-to-date. Suppose that in this situation there is an underspun yarn $y_i$, associated with query node $q_i$. That means that $y_i$ contains a text node $s$ that does not belong to any final result.
Let us have a closer look at how $q_i$ lies hooked up in the query graph. Because the query is a connected component, we have only these possibilities:
Case a: $q_i$ does not belong to an edge.
In this case, $q_i$ must the only node, because if there were other nodes, there would have been bridges from $q_i$ to those other nodes, and hence $q_i$ would be involved in edges.
So, the query is just one node without edges, hence $y_i$ is the primary yarn, and hence all nodes in it are results. So this $s$ cannot exist.
Case b: $q_i$ has an out-going edge or an incoming edge.
For all $q_i$-outgoing edges $(R_j, q_i, q_t)$ it holds that there is a $u$ in the yarn $y_t$ of $q_t$ such that $sR_ju$. This is because all edges are up-to-date.
Like wise, for all $q_i$-incoming edges $(R_k, q_f, q_i)$ it holds that there is a $u$ in the yarn $y_f$ of $q_f$ such that $uR_ks$.
If all $u$s found in this way belonged to the result, then $s$ would also belong to the result, because all its edge constraints would be satisfied.
So, if, as we assumed, $s$ is not in the result, then at least one of those $u$s does not belong to the result.
Then we can repeat the same argument for this $u$, and find a $v$ that does not belong to the result.
Since there are finitely many nodes, we find a cycle $s_1, ... , s_k$ of text nodes in yarns $y_1, ... , y_k$ that do not belong to the result.
This contradicts the assumption. QED.
In order to show that the condition of no cycles in Lemma 5 cannot be missed, we show a small example of a graph with a cycle that will not become fully spun by the algorithm.
$$G = (\{q_1, q_2\}, \{(R, a, c), (R, b, d), S(c, b), S(d, a)\})$$with primary yarns $y_1, y_2$ for $q_1, q_2$ as follows:
$$y_1 = \{a, b\}$$$$y_2 = \{c, d\}$$If we spin the edges $R$ and $S$, nothing happens, because the elements $a, b \in y_1$ are happily in relationship $R$ with elements $c, d$ in $y_2$, and the elements $c, d \in y_2$ are happily in relationship $S$ with elements $a, b$ in $y_2$.
Yet, none of them are part of any result, because there are no results, because the equation
$$xRySx$$does not have any solutions.
The trouble with a cycle is, that it creates a long distance dependency in the final results that cannot be "felt" by spinning.
You can also have long distance dependencies by confluence, but these will be eventually solved by just spinning.
For example: if we have to solve both of the following
$$uRvSwTxUk$$$$aGbHcKmUk$$then we could solve both chains separately, and if both have a solution, we now that the combined solution is a complete solution. Hence, if there is no solution, one of the two will have no solution.
In case of the cycle, we are essentially trying to solve
$$xRySz$$$$x = z$$where $x = z$ is the long distance dependency.
Spinning is just part of the solution. It thins out the space in which solutions exist, and does a better job for graphs without cycles than for graphs with cycles.
But after spinning, we have to find the valid stitches between the yarns.
That is a separate search process. The challenge is to wade through all possibilities in an efficient way.
For graphs without cycles we know that if we start with any text node in any yarn, it will be part of a result. And if we start stitching that node to a node in the next yarn, we are sure that the pair is also part of a result. And so on. So in this case we can stitch arbitrarily (note that stitching implies that we hop from one yarn to the other by following the appropriate relationships between text nodes), and always generate results.
In this case, it does not matter much in what order we stitch, because we need to make all stitchings anyway to generate results.
If the graph does have cycles, then we will discover that certain partial stitchings will lead to no valid results. Because of that strategy becomes important, because if we are not smart, we could spend a lot of time in rejecting millions of stitchings, before arriving at the first valid stitching.
Here is an example, and let us become a bit more concrete.
Suppose our graph has nodes: $Sentence$, $Word1$, $Word2$, all without feature restrictions. So the primary yarns are all sentences, all words, all words, respectively. Suppose we have 100,000 sentences, all 10 words, so 1 million words in total.
Suppose our graph has edges:
word1
is in sentence
word2
is in sentence
word1
comes before word2
A solution is a tuple of text nodes $s$, $w_1$, $w_2$, such that $w_1$ is a word in sentence $s$, $w_2$ is a word in the same sentence $s$, and $w_1$ comes before $w_2$.
Let's start stitching by first picking word1
, then word2
, then sentence
.
We start to pick the first word $w_1$ in the yarn of word1
, which is all words>
Then we pick a word $w_2$ in the yarn of word2
, such that $w_2$ comes after $w_1$.
Note that we have 999,999 possibilities for $w_2$, of which only the first 10 are part of a result. All others are not in the same sentence.
So if we try out all these $w_2$s, we get 10 results fairly quickly, and then 999,990 spurious tries, before we try a new node in the yarn of $Word1$ and get new results.
By the time we have collected all results, we have visited a million times on average half a million words, or $1,000,000 \times 500,000 = 500,000,000,000$ words.
We could have done it in an other way: after picking $w_1$ from the yarn of $Word1$, we pick a sentence $s_1$ from the yarn of Sentence (only one possibility), and from there we pick a $w_2$ from the yarn of $Word2$ such that $w_2$ is in $s_1$ (only 10 possibilities). Of those 10 possibilities, one will be rejected, (where $w_2 = w_1$).
After this, we move to the next $w_1$ in the yarn of $Word1$, which is the second word in the same sentence, pick the same $s_1$ in the sentence yarn, pick the same words in the $Word2$ yarn, reject 2 of them, and deliver 8 results. And so on. When we have gone through the whole of $s_1$, we have tried 100 words, and rejected 50 of them. And so it goes for all sentences. In the end we find all 100,000 * 50 = 5,000,000 solutions by visiting 10,000,000 words.
This is a 50,000 fold improvement!
So how can we build a strategy from this observation?
The challenge is now to run the edges in an optimal sequence.
The basic intuition is this.
We are going to rank query nodes by how strong they have filtered their node type so far.
Then the query fraction $q(n)$ is the a proportion between the number of text nodes in the current result: and the total number of text nodes in the node type:
$$q(n) = {{|r_n|}\over {|o_n|}}$$Then we rank the edges by combined query fraction of the nodes:
$$ r(f,t) = q(f)^2 + q(t)^2$$By squaring both query fractions, we strongly give precedence to edges involving few results.
Consider the relationships by which we hop from one yarn to another, while stitching. Some of them are functional, in the sense that coming from node, they leave only one possibility for the next node in the stitching.
Take for example the relation: sentence of. Coming from a word, this relationship leaves us but one choice: the one sentence of which that word is a part.
Other relationships are virtually the opposite of functional: they leave very many options.
Take for example the relation: comes after between words. For any word you have on average the choice of half the total amount of words.
Some relationships that are not functional are still functional in the opposite direction.
Take for example: contains between sentences and words. A sentence contains multiple words, so contains is not functional. But going into the other direction, is the precisely the relation: sentence of, which is functional.
Between functional and non-functional relations there is a spectrum of functionalness. Let us call this notion the spread of a relation.
The spread of a relation $R$ is the average number of $t$ that satisfies the equation $fRt$ for each $f$.
In other words, for every from node, compute to how many to nodes $R$ brings you. Take the average, and that is your spread.
When stitching, we want to follow the query graph in such a way that we hop from yarn to yarn by edges with relations with lesser spread first.
This could be a way: