Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.
Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.
The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download here. Here's the beginning of the file:
Here are explanations of each column:
And below is a sample and some information about the dataset:
# Import pandas module
import pandas as pd
# Read dataset into dataframe
jeopardy = pd.read_csv("jeopardy.csv")
#Show first 5 rows.
jeopardy.head()
Show Number | Air Date | Round | Category | Value | Question | Answer | |
---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams |
Before analysis can begin I need to normalize all of the data. For example, I need to make sure that all strings are in lower case and that numeric values are integers.
# Import modules
import re
import datetime as dt
# Remove spaces from column titles
jeopardy.columns = jeopardy.columns.str.replace(" ",'')
# Function to remove punctuation and upper case
def normalize(value):
value = value.lower()
value = re.sub('[^A-Za-z0-9\s]', '', value)
value = re.sub("\s+", " ", value)
return value
# Create clean answer and question columns
jeopardy["clean_question"] = jeopardy['Question'].apply(normalize)
jeopardy["clean_answer"] = jeopardy['Answer'].apply(normalize)
# Clean value columns
def normalize_2(value):
value = re.sub('[^A-Za-z0-9\s]', ' ', value)
try:
value = int(value)
except Exception:
value = 0
return value
# Create clean value column
jeopardy["clean_value"] = jeopardy['Value'].apply(normalize_2)
# Change Air date column to datetime object
jeopardy['AirDate'] = pd.to_datetime(jeopardy["AirDate"])
# Show
jeopardy
ShowNumber | AirDate | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_value | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus | 200 |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe | no 2 1912 olympian football star at carlisle i... | jim thorpe | 200 |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona | 200 |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's | in 1963 live on the art linkletter show this c... | mcdonalds | 200 |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams | signer of the dec of indep framer of the const... | john adams | 200 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
19994 | 3582 | 2000-03-14 | Jeopardy! | U.S. GEOGRAPHY | $200 | Of 8, 12 or 18, the number of U.S. states that... | 18 | of 8 12 or 18 the number of us states that tou... | 18 | 200 |
19995 | 3582 | 2000-03-14 | Jeopardy! | POP MUSIC PAIRINGS | $200 | ...& the New Power Generation | Prince | the new power generation | prince | 200 |
19996 | 3582 | 2000-03-14 | Jeopardy! | HISTORIC PEOPLE | $200 | In 1589 he was appointed professor of mathemat... | Galileo | in 1589 he was appointed professor of mathemat... | galileo | 200 |
19997 | 3582 | 2000-03-14 | Jeopardy! | 1998 QUOTATIONS | $200 | Before the grand jury she said, "I'm really so... | Monica Lewinsky | before the grand jury she said im really sorry... | monica lewinsky | 200 |
19998 | 3582 | 2000-03-14 | Jeopardy! | LLAMA-RAMA | $200 | Llamas are the heftiest South American members... | Camels | llamas are the heftiest south american members... | camels | 200 |
19999 rows × 10 columns
As you can see above the data has now been normalized and I can begin to analyze.
The first thing I will check is to see how often the answer is included in the question. Perhaps it is a good stratergy to find the answer in the words of the question asked.
# Create function to find answers in questions
def answers_in_questions(row):
# Split questions by word
split_answer = row["clean_answer"].split()
split_question = row["clean_question"].split()
# Remove 'the'
if 'the' in split_answer:
split_answer.remove('the')
if len(split_answer) == 0:
return 0
match_count = 0
# Loop through and add 1 for each repeated word
for word in split_answer:
if word in split_question:
match_count += 1
return match_count / len(split_answer)
# Apply function to dataset
jeopardy["answer_in_question"] = jeopardy.apply(answers_in_questions, axis=1)
# Find mean
average = jeopardy["answer_in_question"].mean()
# Print
print("Probability that the answer is in the question: {}".format(average))
Probability that the answer is in the question: 0.059001965249777744
On average the answer is in the question less than 6% of the time. Not enough to help really. In Jeopardy you lose the value of the question if you get the answer wrong. Gambling on a 6% chance is not a winning stratergy.
The next thing I will check to see is if questions are repeated often. If they are then we could study the common ones and increase our chance of winning.
# Find questions that come up more than once.
repeated_questions = jeopardy['clean_question'].value_counts()
repeated_questions = repeated_questions[repeated_questions >= 2]
repeated_questions
audio clue 5 his pride had cast him out from heaven with all his host of rebel angels 2 these fell great oaks 2 when it absolutely positively has to be there overnight 2 adam levine 2 common in dixie a razorback is a wild one of these 2 1967 we rob banks 2 poi a luau treat is made from these mashed roots 2 in nicolais opera the merry wives of windsor this fat funny rogue gets dumped into the river in a laundry basket 2 Name: clean_question, dtype: int64
Out of 20,000 questions only 9 have been repeated. One of these is titled 'audio clue' and some of these are quite criptic and require added context by knowing the topic of the round. This therefore does not help us. However, although the question may not repeated word for word perhaps there are complex words (6+ characters) that pop up in many questions:
# Probability that complex words are in older questions.
question_overlap = []
terms_used = set()
jeopardy.sort_values("AirDate")
# Loop through and find words from previous questions
for index, row in jeopardy.iterrows():
split_question = row['clean_question'].split(" ")
split_question = [q for q in split_question if len(q)> 5]
match_count = 0
for word in split_question:
if word in terms_used:
match_count += 1
for word in split_question:
terms_used.add(word)
if len(split_question) > 0:
match_count = match_count / len(split_question)
question_overlap.append(match_count)
# Apply function
jeopardy["question_overlap"] = question_overlap
average_2 = jeopardy["question_overlap"].mean()
print("Probability that complex words are in past questions: {}".format(average_2))
Probability that complex words are in past questions: 0.6908737315671878
Nearly 70% is encouraging but it should be pointed out that these are just words and not phrases. It can therefore be difficult to assertain the context of how the word is used. However, the 70% is encouraging enough to look in more detail. Let's dig a little deeper.
Another column that we have is the value column. Each question is worth a different amount. I am now going to see if the high value questions (800 +) are more likely to have words that have been used in past questions:
# Create column to define high or low value
def find_value(value, axis=1):
if value > 800:
return 1
else:
return 0
# Apply function
jeopardy["high_value"] = jeopardy['clean_value'].apply(find_value)
# Words in high and low value questions
def usage(value):
low_count = 0
high_count = 0
for index, row in jeopardy.iterrows():
clean_question = row[7].split()
if value in clean_question:
if row["high_value"] == 1:
high_count += 1
else:
low_count += 1
return [high_count, low_count]
from random import sample
# Find expected proportion of words in past questions
comparison_terms = sample(terms_used, 10)
observed_expected = []
for word in comparison_terms:
observed_expected.append(usage(word))
observed_expected
[[0, 2], [1, 4], [0, 1], [1, 0], [0, 1], [0, 2], [0, 2], [0, 1], [0, 1], [0, 1]]
import numpy as np
from scipy.stats import chisquare
# Filter high and low values
high_value_count = len(jeopardy[jeopardy["high_value"] == 1])
low_value_count = len(jeopardy[jeopardy["high_value"] == 0])
chi_squared = []
# Find chi value and p value
for high, low in observed_expected:
total = high + low
total_prop = total / len(jeopardy)
high_expected = total_prop * high_value_count
low_expected = total_prop * low_value_count
observed = np.array([high, low])
expected = np.array([high_expected, low_expected])
chi_values = chisquare(observed, expected)
chi_squared.append(chi_values)
chi_squared
[Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913673), Power_divergenceResult(statistic=0.06325251982741063, pvalue=0.8014271475031749), Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378), Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953), Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378), Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913673), Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913673), Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378), Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378), Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378)]
None of the P values are below 5% so there is nothing statistically significant here. High value questions are no more likely to repeat than low value question.
The next step is to take a closer look at the most common complex words. I am going to find the complex words that appear in over 1% of questions.
value_counts = {}
# Find common words in questions
for index, row in jeopardy.iterrows():
split_question = row['clean_question'].split(" ")
split_question = [q for q in split_question if len(q)> 5]
for word in split_question:
if word in value_counts:
value_counts[word] += 1
else:
value_counts[word] = 1
# Only words above 1%
above_200 = {}
for key, value in value_counts.items():
if value >= 200:
above_200[key] = round((value / 20000) * 100, 3)
above_200
{'president': 1.29, 'island': 1.08, 'country': 2.38, 'targetblankherea': 1.22, 'called': 2.605, 'french': 1.215, 'famous': 1.23, 'capital': 1.285, 'became': 1.435, 'played': 1.485, 'before': 1.335, 'american': 1.285}
A few topics that a jeopady conestant could study are presidents, countries and capitals.
Let's use the same technique to check out the categories column:
# Common categories
value_counts = {}
for word in jeopardy["Category"] :
if word in value_counts:
value_counts[word] += 1
else:
value_counts[word] = 1
# Filter common categories
above_30 = {}
for key, value in value_counts.items():
if value >= 30:
above_30[key] = round((value / 20000) * 100, 3)
above_30
{'HISTORY': 0.2, 'IN THE DICTIONARY': 0.155, 'TRAVEL & TOURISM': 0.15, 'SCIENCE': 0.175, 'WORD ORIGINS': 0.19, 'TELEVISION': 0.255, 'ANNUAL EVENTS': 0.16, 'MEDICINE': 0.15, 'WORLD GEOGRAPHY': 0.165, 'AMERICAN HISTORY': 0.2, 'POTPOURRI': 0.15, 'SCIENCE & NATURE': 0.175, 'AUTHORS': 0.195, 'MAGAZINES': 0.175, 'U.S. GEOGRAPHY': 0.25, 'RHYME TIME': 0.175, 'BIRDS': 0.155, 'FICTIONAL CHARACTERS': 0.155, 'SPORTS': 0.18, 'LITERATURE': 0.225, 'WORLD CAPITALS': 0.185, 'ISLANDS': 0.15, 'HISTORIC NAMES': 0.16, 'BEFORE & AFTER': 0.2, 'OPERA': 0.15, 'BODIES OF WATER': 0.18, 'WORLD HISTORY': 0.16, 'U.S. PRESIDENTS': 0.15}
Once again we have capitals and presidents.
Notice that some of these have repeating words. For example, we have history and American history. Let's see which words pop up in the categories most common:
# Common words in categories
word_count = {}
for words in value_counts.keys():
split_categories = words.split()
split_categories = [w for w in split_categories if len(w) > 4]
for word in split_categories:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
# Filter common words in categories
above_30 = {}
for key, value in word_count.items():
if value >= 30:
above_30[key] = round((value / 20000) * 100, 3)
above_30
{'HISTORY': 0.19, 'WORDS': 0.32, 'SPORTS': 0.165, 'WORLD': 0.3, 'MOVIES': 0.165, 'STATE': 0.155, 'MUSIC': 0.155, 'MOVIE': 0.16, 'CENTURY': 0.295}
The best categories to study are listed above.
Jeopardy also has different rounds. The most important round by far is final jeopardy where contestants have the chance to double their money. Let's see if there are any categories that come up a lot in the final jeopardy round:
# Filter final round questions
final_jeopardy = jeopardy[jeopardy["Round"] == "Final Jeopardy!"]
# Categories that appear more than once
counts = final_jeopardy['Category'].value_counts()
counts = counts[counts > 2]
counts
WORD ORIGINS 8 U.S. PRESIDENTS 5 AUTHORS 4 FAMOUS NAMES 4 AMERICAN LITERATURE 3 POETS 3 SPACE EXPLORATION 3 WORLD CITIES 3 ASIA 3 WORLD GEOGRAPHY 3 U.S. STATES 3 ARTISTS 3 WORLD LEADERS 3 SCIENTISTS 3 THE 50 STATES 3 FAMOUS WOMEN 3 Name: Category, dtype: int64
Word orgins is up there again as are other common categories and questions we have seen before like presidents.
My best advice to a Jeopady contestant.