Information Retrieval Lab: Advanced Retrieval

In this exercise, we will implement a more advanced retrieval model. In contrast to the last exercise, where we explored vector space retrieval using TFIDF-vectors, in this notebook, we will take a look at a generative (language) model for retrieval.

Lecture Recall

First, recall some important concepts from the lecture:

Document representations

The representation $\pmb{d}$ of a document $d$ is a probability distribution over the index terms $T$, where the probability of the $i$-th term in $T$ indicates how likely it occurs in documents generated by the topics’ language model that also generated $\pmb{d}$.

Query representations

A query $q$ is represented as a sequence $\pmb{q}$ of index terms.

Relevance Function $\rho$

The relevance function formulated as a query likelihood model: $\rho(d,q) =P(\pmb{d}|\pmb{q})$

Setup

To implement a scoring model, we first import existing code from the previous exercises. We will make use of the Shakespeare document collection, the preprocessing component and the index. We supply a reference solution you can import, but feel free to continue using your own code!

Exercise: using the code from previous exercises, construct an index on top of the Shakespeare document collection.

In [1]:
from modules.shakespeare import corpus
from modules.preprocessing import PreprocessorSpacy as Preprocessor
from modules.indexing import Index

preprocessor = Preprocessor()
corpus_preprocessed = list(zip(
    [doc[0] for doc in corpus],
    map(preprocessor.preprocess, [doc[1] for doc in corpus])
))

index = Index(corpus_preprocessed)

Deriving the relevance function $\rho$

Let $d$ denote a language model for document, and $q$ the sequence of query terms from query. Then the query likelihood model is derived as follows, utilizing Bayes' rule:

$$ \rho = P(\pmb{d} | \pmb{q}) = \frac{P(\pmb{q}|\pmb{d})\cdot P(\pmb{d})}{P(\pmb{q})} $$

Here, $P(\pmb{q})$ can be omitted without affecting the relative ranking of all documents (i.e. omitting it influences all scores equally), and $P(\pmb{d})$ is assumed uniform, cancelling its influence.

The query representation $\pmb{q}$ can be written as sequence of $t_1, ..., t_{|q|}$ of index terms, which (under the assumption that terms are independent of each other) allows us to rewrite the equation as follows:

$$ \rho = P(q | \pmb{d}) = P(t_1, ..., t_{|q|} | \pmb{d}) = \prod_{i=1}^{|q|} P(t_i | \pmb{d})$$

This means that we can divide our implementation into four steps:

  1. Write a function to calculate the probability $P(t | d)$ for a single term
  2. Apply this function to all query terms
  3. Multiply the results to get the relevance score of a document
  4. Repeate 1-3 for all documents to rank them w.r.t the query

We can use maximum likelihood estimation to calculate the term probability:

$$ P(t | \pmb{d}) = \frac{\text{tf}(t,d)}{|d|}$$

However, this creates a problem: if a term $t$ is contained in the query, but does not occur in a document (i.e. $\text{tf}(t,d) = 0$), $P(t|\pmb{d}) = 0$. Since the individual term probabilities are multiplied, this results in zero relevance for all documents that only partially contain the query!

Solution: we also calculate the probability of the term based on the language model of the corpus, $D$, and combine both probabilities.

Document Probability

As first part to our relevance function, we implement the probability of a term based on a document language model $P(t|\pmb{d})$.

$$ P(t | \pmb{d}) = \frac{\text{tf}(t,d)}{|d|}$$

Exercise: implement a function to calculate the term probability give a document.

In [2]:
def document_probability(index, term, doc_id):
    """
    Calculates the conditional probability of a term give a document.
    :param index: index to get frequency data from
    :param term: term to calculate the probability for
    :param doc_id: document to calculate the probability for
    """
    frequency = index.get_term_frequency(term, doc_id)
    doc_sum = 0
    for t in index.get_index_terms():
        doc_sum += index.get_term_frequency(t, doc_id)
    if doc_sum == 0:
        return 0
    else:   
        return frequency/doc_sum
    

Collection Probability

As first part to our relevance function, we implement the probability of a term based on a document language model $P(t|\pmb{D})$.

$$ P(t | \pmb{D}) = \frac{\sum_{d \in D}\text{tf}(t,d)}{\sum_{d \in D}|d|}$$

Exercise: implement a function to calculate the term probability give a document collection.

In [3]:
def collection_probability(index, term):
    """
    Calculates the conditional probability of a term give a document collection.
    :param index: index to get frequency data from
    :param term: term to calculate the probability for
    """
    frequencies = []
    doc_sums = []
    
    for doc_id in index.get_document_ids():
        frequencies.append(index.get_term_frequency(term, doc_id))
        doc_sum = 0
        for t in index.get_index_terms():
            doc_sum += index.get_term_frequency(t, doc_id)
        doc_sums.append(doc_sum)
        
    return sum(frequencies)/sum(doc_sums)

Combining Probabilities

Jelinek-Mercer-Smoothing

Now that we can calculate both the document probability and the collection probability of a term $t$, we need a way of combining them into a single value, indicating the overall probability of the term. To start out, we implement Jelinek-Mercer smoothing, which is a linear interpolation between both probabilities:

$$P(t|\pmb{d}) = (1- \omega) \cdot P(t | \pmb{d}) + \omega\cdot P(t|\pmb{D})$$

where $\omega$ is a paramter that controls the relative influence of both terms on the result.

Note: we call the weigthing term $\omega$ in this exercise. In the lecture slides, this is denoted as $\lambda$, but since lambda is a reserved keyword in the Python language, we opt to rename it to avoid confusion.

Exercise: implement the overall term probability with Jelinek-Mercer smoothing.

In [4]:
def jm_term_probability(index, term, doc_id, omega):
    """
    Calculates the conditional probability of a term give a document using Jelinek-Mercer smoothing.
    :param index: index to get frequency data from
    :param term: term to calculate the probability for
    :param doc_id: document to calculate the probability for
    :param omega: weigthing factor for the linear interpolation
    """
    p1 = document_probability(index, term, doc_id)
    p2 = collection_probability(index, term)
    return (1-omega) * p1 + omega * p2

Dirichlet-Smoothing

Instead of interpolating linearly between both probabilities, we can take a more fine-grained approach and base the interpolation factor on the length of the document: the longer it is, the more confidence is placed on the document probability, assuming more data = better prediction.

This substitutes the term $\omega$ with the following term:

$$\omega = \frac{\alpha}{|d| + \alpha}$$

where $\alpha$ is a parameter that indicates the boundary at which more weight is put towards the document probability or the collection probability.

Note: this method is called Dirichlet smoothing, since from a Bayesian point of view, this corresponds to the expected value of the posterior distribution, using a symmetric Dirichlet distribution with parameter $\alpha$ as a prior.

Exercise: implement a weight function to calculate $\omega$ for a given document and $\alpha$ value.

In [5]:
def weight(index, doc_id, alpha):
    """
    Calculates the dirichlet smoothing weighting factor for a given document and alpha value
    :param index: index to get frequency data from
    :param doc_id: document to calculate the weight factor for
    :param alpha: alpha-prior for the dirichlet smoothing
    """
    doc_len = 0
    for term in index.get_index_terms():
        doc_len += index.get_term_frequency(term, doc_id)
    return alpha / (doc_len + alpha)

Exercise: implement the overall term probability with Dirichlet smoothing.

In [6]:
def dirichlet_term_probability(index, term, doc_id, alpha):
    """
    Calculates the conditional probability of a term give a document using Dirichlet smoothing.
    :param index: index to get frequency data from
    :param term: term to calculate the probability for
    :param doc_id: document to calculate the probability for
    :param alpha: alpha-prior for the dirichlet interpolation
    """
    omega = weight(index, doc_id, alpha)
    p1 = document_probability(index, term, doc_id)
    p2 = collection_probability(index, term)
    return (1-omega) * p1 + omega * p2

Implementing Relevance Functions

Jelinek-Mercer Smoothing

Given the probability function for a single term $t$ using Jelinek-Mercer-Smoothing, we can now implement the full relevance scoring function. We will however utilize a slightly different formulation of the total probability by taking its logarithm:

$$ \rho = \log\prod_{i=1}^{|q|} P(t_i | \pmb{d}) = \sum_{i=1}^{|q|} \log P(t_i | \pmb{d})$$

The reason behind this is that the individual term probabilities can get extremely small. If we multiply them, we might run into an underflow: the number being too small to be represented by a 64-bit float. Therefore, utilizing the logarithm product rule, we can instead calculate the sum of the logarithm of the individual probabilities. Note that logarithmization yields negative relevance scores; recall that only the relative ranking among documents is important, not the scores themselves.

Exercise: implement the relevance function $\rho$ with Jelinek-Mercer smoothing. Use the sum of log probabilities as defined above.

In [7]:
from math import log

def jm_score(index, query, doc_id, omega):
    """
    Calculates the relevance of a document given a query using Jelinek-Mercer smoothing.
    :param index: index to get relevance data from
    :param query: query to calculate the relevance for
    :param doc_id: document to calculate the relevance for
    :param omega: omega paramter for Jelinek-Mercer smoothing
    """
    rho = 1
    for term in query:
        rho += log(jm_term_probability(index, term, doc_id, omega))
    return rho

Dirichlet Smoothing

Similarly, we can implement $\rho$ with Dirichlet smoothing, once again applying the logarithm.

Exercise: implement the relevance function $\rho$ with Dirichlet smoothing. Use the sum of log probabilities as defined above.

In [8]:
from math import log

def dirichlet_score(index, query, doc_id, alpha):
    """
    Calculates the relevance of a document given a query using Dirichlet smoothing.
    :param index: index to get relevance data from
    :param query: query to calculate the relevance for
    :param doc_id: document to calculate the relevance for
    :param alpha: alpha paramter for Dirichlet smoothing
    """
    rho = 1
    for term in query:
        rho += log(dirichlet_term_probability(index, term, doc_id, alpha))
    return rho

Querying the index

We now have all the components needed to answer queries. Write a wrapper function that takes a text and returns the (top $k$) results for the query. It construct a query representation and calculate the pairwise probability between the query and all documents, returning an ordered list of (doc_id, score) tuples, descending by score.

Jelinek-Mercer Smoothing

Exercise: implement a query function for Jelinek-Mercer smoothing. Successful parameters are $\omega= 0.1$ and $\omega$= 0.7 for short and long queries, respectively.

In [9]:
def jm_query(index, preprocessor, text, omega=0.1, topK=-1):
    """
    Queries a given text against the given index using a Jelinek-Mercer smoothed language model
    :param preprocessor: preprocessor instance to process the query with
    :param index: the index data to query against
    :param text: query text
    :param omega: weight parameter for Jelinek-Mercer smoothing
    :param topK: number of top results to return
    :return: list of (doc_id, score) tuples descending by score for all documents in the vector space
    """
    query = preprocessor.preprocess(text)
    scores = {}
    for doc_id in index.get_document_ids():
        scores[doc_id] = jm_score(index, query, doc_id, omega=omega)
        
    return sorted(scores.items(), key=lambda item: item[1], reverse=True)[:topK]
In [10]:
jm_query(index, preprocessor, "proceed", topK=5)
Out[10]:
[('Coriolanus', -0.203513772874222),
 ('Comedy of Errors', -1.1835784414003485),
 ('Troilus and Cressida', -7.890135107818841),
 ('Henry IV, Part II', -7.890135107818841),
 ('Much Ado About Nothing', -7.890135107818841)]

Dirichlet Smoothing

Exercise: implement a query function for Dirichlet smoothing. Typical parameters are $1000 < \alpha < 2000$.

In [11]:
def dirichlet_query(index, preprocessor, text, alpha=1000, topK=-1):
    """
    Queries a given text against the given index using a Dirichlet smoothed language model
    :param preprocessor: preprocessor instance to process the query with
    :param index: the index data to query against
    :param text: query text
    :param alpha: alpha-parameter for Dirichlet smoothing
    :param topK: number of top results to return
    :return: list of (doc_id, score) tuples descending by score for all documents in the vector space
    """
    query = preprocessor.preprocess(text)
    scores = {}
    for doc_id in index.get_document_ids():
        scores[doc_id] = dirichlet_score(index, query, doc_id, alpha=alpha)
        
    return sorted(scores.items(), key=lambda item: item[1], reverse=True)[:topK]
In [12]:
dirichlet_query(index, preprocessor, "proceed", topK=5)
Out[12]:
[('Coriolanus', -5.044738931143362),
 ('Comedy of Errors', -5.049711591812738),
 ('Hamlet', -5.587550014824796),
 ('Measure for Measure', -5.5885495151578795),
 ('Tempest', -5.5885495151578795)]