Winning at Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download here. Here's the beginning of the file:

Here are explanations of each column:

  • Show Number - the Jeopardy episode number
  • Air Date - the date the episode aired
  • Round - the round of Jeopardy
  • Category - the category of the question
  • Value - the number of dollars the correct answer is worth
  • Question - the text of the question
  • Answer - the text of the answer

And below is a sample and some information about the dataset:

In [1]:
# Import pandas module
import pandas as pd

# Read dataset into dataframe
jeopardy = pd.read_csv("jeopardy.csv")
#Show first 5 rows. 
jeopardy.head()
Out[1]:
Show Number Air Date Round Category Value Question Answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams

Before analysis can begin I need to normalize all of the data. For example, I need to make sure that all strings are in lower case and that numeric values are integers.

Normalize

In [2]:
# Import modules
import re
import datetime as dt

# Remove spaces from column titles
jeopardy.columns = jeopardy.columns.str.replace(" ",'')

# Function to remove punctuation and upper case
def normalize(value):
    value = value.lower()
    value = re.sub('[^A-Za-z0-9\s]', '', value)
    value = re.sub("\s+", " ", value)
    return value

# Create clean answer and question columns
jeopardy["clean_question"] = jeopardy['Question'].apply(normalize)
jeopardy["clean_answer"] = jeopardy['Answer'].apply(normalize)


# Clean value columns
def normalize_2(value):
    value = re.sub('[^A-Za-z0-9\s]', ' ', value)
    try:
        value = int(value)
    except Exception:
        value = 0
    return value

# Create clean value column
jeopardy["clean_value"] = jeopardy['Value'].apply(normalize_2)

# Change Air date column to datetime object
jeopardy['AirDate'] = pd.to_datetime(jeopardy["AirDate"])
# Show
jeopardy
Out[2]:
ShowNumber AirDate Round Category Value Question Answer clean_question clean_answer clean_value
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus 200
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe 200
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona 200
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds 200
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams 200
... ... ... ... ... ... ... ... ... ... ...
19994 3582 2000-03-14 Jeopardy! U.S. GEOGRAPHY $200 Of 8, 12 or 18, the number of U.S. states that... 18 of 8 12 or 18 the number of us states that tou... 18 200
19995 3582 2000-03-14 Jeopardy! POP MUSIC PAIRINGS $200 ...& the New Power Generation Prince the new power generation prince 200
19996 3582 2000-03-14 Jeopardy! HISTORIC PEOPLE $200 In 1589 he was appointed professor of mathemat... Galileo in 1589 he was appointed professor of mathemat... galileo 200
19997 3582 2000-03-14 Jeopardy! 1998 QUOTATIONS $200 Before the grand jury she said, "I'm really so... Monica Lewinsky before the grand jury she said im really sorry... monica lewinsky 200
19998 3582 2000-03-14 Jeopardy! LLAMA-RAMA $200 Llamas are the heftiest South American members... Camels llamas are the heftiest south american members... camels 200

19999 rows × 10 columns

As you can see above the data has now been normalized and I can begin to analyze.

The first thing I will check is to see how often the answer is included in the question. Perhaps it is a good stratergy to find the answer in the words of the question asked.

Is the answer in the question?

In [3]:
# Create function to find answers in questions
def answers_in_questions(row):
    # Split questions by word
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    # Remove 'the'
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    match_count = 0
    # Loop through and add 1 for each repeated word
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count / len(split_answer)

# Apply function to dataset
jeopardy["answer_in_question"] = jeopardy.apply(answers_in_questions, axis=1)
# Find mean
average = jeopardy["answer_in_question"].mean()
# Print
print("Probability that the answer is in the question: {}".format(average))
Probability that the answer is in the question: 0.059001965249777744

On average the answer is in the question less than 6% of the time. Not enough to help really. In Jeopardy you lose the value of the question if you get the answer wrong. Gambling on a 6% chance is not a winning stratergy.

The next thing I will check to see is if questions are repeated often. If they are then we could study the common ones and increase our chance of winning.

Are Questions Reused?

In [4]:
# Find questions that come up more than once. 
repeated_questions = jeopardy['clean_question'].value_counts()
repeated_questions = repeated_questions[repeated_questions >= 2]
repeated_questions
Out[4]:
audio clue                                                                                                          5
his pride had cast him out from heaven with all his host of rebel angels                                            2
these fell great oaks                                                                                               2
when it absolutely positively has to be there overnight                                                             2
adam levine                                                                                                         2
common in dixie a razorback is a wild one of these                                                                  2
1967 we rob banks                                                                                                   2
poi a luau treat is made from these mashed roots                                                                    2
in nicolais opera the merry wives of windsor this fat funny rogue gets dumped into the river in a laundry basket    2
Name: clean_question, dtype: int64

Out of 20,000 questions only 9 have been repeated. One of these is titled 'audio clue' and some of these are quite criptic and require added context by knowing the topic of the round. This therefore does not help us. However, although the question may not repeated word for word perhaps there are complex words (6+ characters) that pop up in many questions:

In [5]:
# Probability that complex words are in older questions.
question_overlap = []
terms_used = set()
jeopardy.sort_values("AirDate")

# Loop through and find words from previous questions
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [q for q in split_question if len(q)> 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)
    
# Apply function
jeopardy["question_overlap"] = question_overlap
average_2 = jeopardy["question_overlap"].mean()
print("Probability that complex words are in past questions: {}".format(average_2))
Probability that complex words are in past questions: 0.6908737315671878

Nearly 70% is encouraging but it should be pointed out that these are just words and not phrases. It can therefore be difficult to assertain the context of how the word is used. However, the 70% is encouraging enough to look in more detail. Let's dig a little deeper.

Another column that we have is the value column. Each question is worth a different amount. I am now going to see if the high value questions (800 +) are more likely to have words that have been used in past questions:

High Value or Low Value

In [6]:
# Create column to define high or low value
def find_value(value, axis=1):
    if value > 800:
        return 1
    else:
        return 0

# Apply function
jeopardy["high_value"] = jeopardy['clean_value'].apply(find_value)
    
In [7]:
# Words in high and low value questions
def usage(value):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        clean_question = row[7].split()
        if value in clean_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return [high_count, low_count]

from random import sample

# Find expected proportion of words in past questions
comparison_terms = sample(terms_used, 10)
observed_expected = []
for word in comparison_terms:
    observed_expected.append(usage(word))
    
observed_expected
Out[7]:
[[0, 2],
 [1, 4],
 [0, 1],
 [1, 0],
 [0, 1],
 [0, 2],
 [0, 2],
 [0, 1],
 [0, 1],
 [0, 1]]
In [8]:
import numpy as np
from scipy.stats import chisquare

# Filter high and low values
high_value_count = len(jeopardy[jeopardy["high_value"] == 1])
low_value_count = len(jeopardy[jeopardy["high_value"] == 0])

chi_squared = []

# Find chi value and p value
for high, low in observed_expected:
    total = high + low
    total_prop = total / len(jeopardy)
    high_expected = total_prop * high_value_count
    low_expected = total_prop * low_value_count
    
    observed = np.array([high, low])
    expected = np.array([high_expected, low_expected])
    
    chi_values = chisquare(observed, expected)
    chi_squared.append(chi_values)
    
chi_squared
Out[8]:
[Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913673),
 Power_divergenceResult(statistic=0.06325251982741063, pvalue=0.8014271475031749),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913673),
 Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913673),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378)]

None of the P values are below 5% so there is nothing statistically significant here. High value questions are no more likely to repeat than low value question.

The next step is to take a closer look at the most common complex words. I am going to find the complex words that appear in over 1% of questions.

Popular Qustion Topics

In [9]:
value_counts = {}

# Find common words in questions
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [q for q in split_question if len(q)> 5]
    for word in split_question:
        if word in value_counts:
            value_counts[word] += 1
        else:
            value_counts[word] = 1
        
# Only words above 1%
above_200 = {}
for key, value in value_counts.items():
    if value >= 200:
        above_200[key] = round((value / 20000) * 100, 3)
        
        
        
above_200
Out[9]:
{'president': 1.29,
 'island': 1.08,
 'country': 2.38,
 'targetblankherea': 1.22,
 'called': 2.605,
 'french': 1.215,
 'famous': 1.23,
 'capital': 1.285,
 'became': 1.435,
 'played': 1.485,
 'before': 1.335,
 'american': 1.285}

A few topics that a jeopady conestant could study are presidents, countries and capitals.

Let's use the same technique to check out the categories column:

In [10]:
# Common categories
value_counts = {}

for word in jeopardy["Category"] :
    if word in value_counts:
        value_counts[word] += 1
    else:
        value_counts[word] = 1
        
# Filter common categories
above_30 = {}
for key, value in value_counts.items():
    if value >= 30:
        above_30[key] = round((value / 20000) * 100, 3)
        
        
        
above_30
Out[10]:
{'HISTORY': 0.2,
 'IN THE DICTIONARY': 0.155,
 'TRAVEL & TOURISM': 0.15,
 'SCIENCE': 0.175,
 'WORD ORIGINS': 0.19,
 'TELEVISION': 0.255,
 'ANNUAL EVENTS': 0.16,
 'MEDICINE': 0.15,
 'WORLD GEOGRAPHY': 0.165,
 'AMERICAN HISTORY': 0.2,
 'POTPOURRI': 0.15,
 'SCIENCE & NATURE': 0.175,
 'AUTHORS': 0.195,
 'MAGAZINES': 0.175,
 'U.S. GEOGRAPHY': 0.25,
 'RHYME TIME': 0.175,
 'BIRDS': 0.155,
 'FICTIONAL CHARACTERS': 0.155,
 'SPORTS': 0.18,
 'LITERATURE': 0.225,
 'WORLD CAPITALS': 0.185,
 'ISLANDS': 0.15,
 'HISTORIC NAMES': 0.16,
 'BEFORE & AFTER': 0.2,
 'OPERA': 0.15,
 'BODIES OF WATER': 0.18,
 'WORLD HISTORY': 0.16,
 'U.S. PRESIDENTS': 0.15}

Once again we have capitals and presidents.

Notice that some of these have repeating words. For example, we have history and American history. Let's see which words pop up in the categories most common:

In [11]:
# Common words in categories
word_count = {}
for words in value_counts.keys():
    split_categories = words.split()
    split_categories = [w for w in split_categories if len(w) > 4]
    for word in split_categories:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
            

# Filter common words in categories
above_30 = {}
for key, value in word_count.items():
    if value >= 30:
        above_30[key] = round((value / 20000) * 100, 3)

above_30
Out[11]:
{'HISTORY': 0.19,
 'WORDS': 0.32,
 'SPORTS': 0.165,
 'WORLD': 0.3,
 'MOVIES': 0.165,
 'STATE': 0.155,
 'MUSIC': 0.155,
 'MOVIE': 0.16,
 'CENTURY': 0.295}

The best categories to study are listed above.

Jeopardy also has different rounds. The most important round by far is final jeopardy where contestants have the chance to double their money. Let's see if there are any categories that come up a lot in the final jeopardy round:

In [12]:
# Filter final round questions
final_jeopardy = jeopardy[jeopardy["Round"] == "Final Jeopardy!"]
# Categories that appear more than once
counts = final_jeopardy['Category'].value_counts()
counts = counts[counts > 2]
counts
Out[12]:
WORD ORIGINS           8
U.S. PRESIDENTS        5
AUTHORS                4
FAMOUS NAMES           4
AMERICAN LITERATURE    3
POETS                  3
SPACE EXPLORATION      3
WORLD CITIES           3
ASIA                   3
WORLD GEOGRAPHY        3
U.S. STATES            3
ARTISTS                3
WORLD LEADERS          3
SCIENTISTS             3
THE 50 STATES          3
FAMOUS WOMEN           3
Name: Category, dtype: int64

Word orgins is up there again as are other common categories and questions we have seen before like presidents.

Final Thoughts

  • The answer is only included in the question 6% of the time. Using words from the question to find the answer is an unhelpful technique.
  • Questions are almost never repeated word for word.
  • Words pop up in old questions 70% of the time. There does seem to be some overlap in at least the type of questions being asked even if they are not word for word.
  • The value of the question does not affect how likely is is to have been repeated before. My best advice to a Jeopady contestant.
  • Judging by common words in past questions and the categories column I would suggest studying U.S presidents, world capitals, countries and word origins. History and geography are popular topics but are a little broad. Making sure that you know the names of all the presidents and capitals is doable and could come in very helpful.