Information Retrieval Lab: Evaluation

In this exercise, we will conduct a laboratory experiment to estimate the effectiveness of different retrieval systems. The goal is to determine, which system fairs best on a given document collection and retrieval task.

Lecture Recall

First, recall some important concepts from the lecture:

Document Collection (corpus)

A “representative” sample of documents from the domain of interest. The sampling method ofhow documents are drawn from the population determines a corpus’s validity.

Information Needs (topics)

Formalized, written descriptions of users’ tasks, goals, or gaps of knowledge. Alternatively,descriptions of desired search results. Often accompanied by specific queries the users(would) have used to search for relevant documents.

Relevance Judgements

Pairs of topics and documents, where each document has been manually assessed withrespect to its relevance to its associated topic. Ideally, the judgments are obtained from thesame users who supplied the topics, but in practice judgments are collected from thirdparties. Judgments may be given in binary form, or on a Likert scale.

This setup is sometimes referred to as an experiment under the Cranfield paradigm, in reference to Cyril Cleverdon’s series of projects at the Cranfield University in the 1960s, which first used this evaluation methodology. In this exercise, we will rely on the same document set, topics and relevance judgements, known as the Cranfield collection.

The collection is provided as zip file alongside this exercise. Go ahead and unpack it somewhere. Update the path below to point to the correct location.

In [1]:
# Relative path to the cranfield collection
DATA_PATH = "../data/cran/"
In [2]:
import re
import pandas as pd

from modules.system import RetrievalSystem
from modules.models import DirichletLM, JelinekMercerLM

Parsing the Document Collection

The first step is to parse the document collection, given in the cran.all.1400 and encompassing 1400 documents. Each document is characterized by a document ID, a title, an author, a bibliographical reference and its text. Since the data is originally from the 60s, the file format is quite arcane. Therefore, as a first task, write a parset to translate the 1400 document each into a dictionary that can be used by python.

Exercise: parse the document collection from the cran.all.1400 file.

In [3]:
with open(DATA_PATH+"cran.all.1400", "r") as file:
    # Your code here
  File "<ipython-input-3-50a3863ccae4>", line 2
    # Your code here
IndentationError: expected an indented block

Now, transform it such that it fits the [(id, text), ...] corpus format you already know from the shakespeare corpus.

Exercise: transform the document to the standardized corpus data structure.

In [ ]:
# Your code here

Creating a Retrieval System

Next, we have to create different retrieval systems that we want to compare. For this exercise, we will compare the models from last week: DirichletLM and JelinkeMercerLM, to see how much of a difference Dirichlet smoothing really brings. However, feel free to also include other retrieval models, such as TF-IDF, or more parameter configurations of the aforementioned.

The three python files distributed alongside this exercise contain an optimized implementation of the index and models, as well as a RetrievalSystem wrapper class to easily create different retrieval systems. You can of course also use your own implementations. Be aware though, that parsing all queries might take a long time!

Exercise: create a Dirichlet retrieval system.

In [ ]:
# Your code here

Exercise: create a Jelinek-Mercer retrieval system.

In [ ]:
# Your code here

Parsing Queries

Now that our retrieval systems each indexed the data and are ready to receive queries, we need a set of representative topics to test them on. The file cran.qry includes a set of topics, each characterized by a number (topic ID) and query (text to be entered into the system. The file format is similar to the document collection file. Parse into a python data structure!

Exercise: parse the query set from the cran.qry file.

In [ ]:
with open(DATA_PATH+"cran.qry", "r") as file:
    # Your code here

Creating a Run

A "run" is the retrieval result of a system on a pre-defined set of documents and queries. By creating runs for different parameter configurations or systems on the same collection and query set, comparative evaluation becomes possible. Runs are created and saved in so-called "run files", a plain-text format denoting a retrieval result.

Run files follow a standardized format: a csv-like file with the following columns, separated by a space character.

qid Q0 doc rank score tag

The meaning of each column is:

qid: The topic number.

Q0: Unused, should always be "Q0".

doc: The document ID (the official ID) returned by your system for the topic qid.

rank: The rank the document is retrieved at.

score: The score (integer or floating point) that generated the ranking. The score must be in descending (non-increasing) order: it is important to handle tie scores.

tag: A tag that uniquely identifies the run.

The fields should be separated by a whitespace. The individual columns' widths are not restricted (i.e., score can be an arbitrary precision that has no ties) but it is important to include all columns and to separate them with a whitespace. Note that the file does not include a column name header.

An example run file could contain the following:

1 Q0 Sf9294c83-Af186e851 1 17.89 MyRetrievalMethod
1 Q0 Sf9294c83-A9a4e056e 2 16.43 MyRetrievalMethod
1 Q0 S96f2396e-Aaf079b43 3 16.42 MyRetrievalMethod

Exercise: conduct a run for each of the retrieval systems created above. Save each result to its own runfile, using the file format described above.

In [ ]:
# Run only 20 queries for now. You can expand this number (there are over 200 topics in the Cranfield collection), 
# but it might take a long time to run
num_queries = 20
In [ ]:
# Run for the DirichletLM system
# Your code here
In [ ]:
# Run for the JelinekMercerLM system
# Your code here

Parsing Relevance Judgements

To evaluate the retrieval performance of a run, we need ground truth data, i.e. human judgement labels that indicate the "true" relevancy of the documents each system deemed relevant. For the Cranfield collection, judgements are available in the cranqrel file. Here, each line containes the following information: topic id, document id and annotated relevance.

Note: the process of how relevance judgements are compiled is also interesting, but too complex for the scope of this exercise. You can read more about it in the lecture slides.

Exercise: parse the cranqrel file into a python data structure.

In [ ]:
with open(DATA_PATH+"cranqrel", "r") as file:
    # Your code here

Since the 60s, the file format for relevance judgements was also standardized (thankfully). To make the Cranfield judgements parseable for the evaluation library we will rely on in the next step, it needs to follow that format. It is similar to the run file format, with the following columns:

qid 0 doc rel

The meaning of each column is:

qid: The topic number.

0: Unused, always 0.

doc: The document ID.

rel: The relevance judgment, a numerical value that denotes how relevant a human deemed this document to the topic.

The fields should be separated by a whitespace. Note that the file does not include a column name header.

Exercise: save the parsed relevance judgements in the standardized file format.

In [ ]:
# Your code here

Evaluating Results

We now have all the components we need to evaluate system performance. We will rely on the Normalized Discounted CumulativeGain (nDCG) measure. It is computed on the ranked list of documents included in a run file and is formalized as follows:

$$nDCG@k = \frac{DCG@k}{IDCG@k}$$

$DCG@k$ corresponds to the Discounted cumulative gain at rank $k$, given by the following formula:

$$DCG@k = \sum_{i=1}^k\frac{2^{r(d_i)}-1}{log_2(1+i)}$$

Here, $r(d_i)$ is there relevance of the document at rank $i$. The IDCG@k is the maximum obtainable DCG at rank $k$, i.e. the DCG you would achieve if documents were ordered descending by their relevance, instead of the order given to them by the retrieval system.

Feel free to implement the measure yourself. However, you can also rely on the excellent trectools package which can read run files, relevance judgment files, and compute nDCG for you.

Exercise: evaluate both tested system using either your own nDCG implementation with the given relevance judgements, or use the trectools package.

In [ ]:
import trectools

# Your code here