Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.
Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.
The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download here. Here's the beginning of the file:
Here are explanations of each column:
And below is a sample and some information about the dataset:
# Import pandas module
import pandas as pd
# Read dataset into dataframe
jeopardy = pd.read_csv("jeopardy.csv")
#Show first 5 rows.
jeopardy.head()
Before analysis can begin I need to normalize all of the data. For example, I need to make sure that all strings are in lower case and that numeric values are integers.
# Import modules
import re
import datetime as dt
# Remove spaces from column titles
jeopardy.columns = jeopardy.columns.str.replace(" ",'')
# Function to remove punctuation and upper case
def normalize(value):
value = value.lower()
value = re.sub('[^A-Za-z0-9\s]', '', value)
value = re.sub("\s+", " ", value)
return value
# Create clean answer and question columns
jeopardy["clean_question"] = jeopardy['Question'].apply(normalize)
jeopardy["clean_answer"] = jeopardy['Answer'].apply(normalize)
# Clean value columns
def normalize_2(value):
value = re.sub('[^A-Za-z0-9\s]', ' ', value)
try:
value = int(value)
except Exception:
value = 0
return value
# Create clean value column
jeopardy["clean_value"] = jeopardy['Value'].apply(normalize_2)
# Change Air date column to datetime object
jeopardy['AirDate'] = pd.to_datetime(jeopardy["AirDate"])
# Show
jeopardy
As you can see above the data has now been normalized and I can begin to analyze.
The first thing I will check is to see how often the answer is included in the question. Perhaps it is a good stratergy to find the answer in the words of the question asked.
# Create function to find answers in questions
def answers_in_questions(row):
# Split questions by word
split_answer = row["clean_answer"].split()
split_question = row["clean_question"].split()
# Remove 'the'
if 'the' in split_answer:
split_answer.remove('the')
if len(split_answer) == 0:
return 0
match_count = 0
# Loop through and add 1 for each repeated word
for word in split_answer:
if word in split_question:
match_count += 1
return match_count / len(split_answer)
# Apply function to dataset
jeopardy["answer_in_question"] = jeopardy.apply(answers_in_questions, axis=1)
# Find mean
average = jeopardy["answer_in_question"].mean()
# Print
print("Probability that the answer is in the question: {}".format(average))
On average the answer is in the question less than 6% of the time. Not enough to help really. In Jeopardy you lose the value of the question if you get the answer wrong. Gambling on a 6% chance is not a winning stratergy.
The next thing I will check to see is if questions are repeated often. If they are then we could study the common ones and increase our chance of winning.
# Find questions that come up more than once.
repeated_questions = jeopardy['clean_question'].value_counts()
repeated_questions = repeated_questions[repeated_questions >= 2]
repeated_questions
Out of 20,000 questions only 9 have been repeated. One of these is titled 'audio clue' and some of these are quite criptic and require added context by knowing the topic of the round. This therefore does not help us. However, although the question may not repeated word for word perhaps there are complex words (6+ characters) that pop up in many questions:
# Probability that complex words are in older questions.
question_overlap = []
terms_used = set()
jeopardy.sort_values("AirDate")
# Loop through and find words from previous questions
for index, row in jeopardy.iterrows():
split_question = row['clean_question'].split(" ")
split_question = [q for q in split_question if len(q)> 5]
match_count = 0
for word in split_question:
if word in terms_used:
match_count += 1
for word in split_question:
terms_used.add(word)
if len(split_question) > 0:
match_count = match_count / len(split_question)
question_overlap.append(match_count)
# Apply function
jeopardy["question_overlap"] = question_overlap
average_2 = jeopardy["question_overlap"].mean()
print("Probability that complex words are in past questions: {}".format(average_2))
Nearly 70% is encouraging but it should be pointed out that these are just words and not phrases. It can therefore be difficult to assertain the context of how the word is used. However, the 70% is encouraging enough to look in more detail. Let's dig a little deeper.
Another column that we have is the value column. Each question is worth a different amount. I am now going to see if the high value questions (800 +) are more likely to have words that have been used in past questions:
# Create column to define high or low value
def find_value(value, axis=1):
if value > 800:
return 1
else:
return 0
# Apply function
jeopardy["high_value"] = jeopardy['clean_value'].apply(find_value)
# Words in high and low value questions
def usage(value):
low_count = 0
high_count = 0
for index, row in jeopardy.iterrows():
clean_question = row[7].split()
if value in clean_question:
if row["high_value"] == 1:
high_count += 1
else:
low_count += 1
return [high_count, low_count]
from random import sample
# Find expected proportion of words in past questions
comparison_terms = sample(terms_used, 10)
observed_expected = []
for word in comparison_terms:
observed_expected.append(usage(word))
observed_expected
import numpy as np
from scipy.stats import chisquare
# Filter high and low values
high_value_count = len(jeopardy[jeopardy["high_value"] == 1])
low_value_count = len(jeopardy[jeopardy["high_value"] == 0])
chi_squared = []
# Find chi value and p value
for high, low in observed_expected:
total = high + low
total_prop = total / len(jeopardy)
high_expected = total_prop * high_value_count
low_expected = total_prop * low_value_count
observed = np.array([high, low])
expected = np.array([high_expected, low_expected])
chi_values = chisquare(observed, expected)
chi_squared.append(chi_values)
chi_squared
None of the P values are below 5% so there is nothing statistically significant here. High value questions are no more likely to repeat than low value question.
The next step is to take a closer look at the most common complex words. I am going to find the complex words that appear in over 1% of questions.
value_counts = {}
# Find common words in questions
for index, row in jeopardy.iterrows():
split_question = row['clean_question'].split(" ")
split_question = [q for q in split_question if len(q)> 5]
for word in split_question:
if word in value_counts:
value_counts[word] += 1
else:
value_counts[word] = 1
# Only words above 1%
above_200 = {}
for key, value in value_counts.items():
if value >= 200:
above_200[key] = round((value / 20000) * 100, 3)
above_200
A few topics that a jeopady conestant could study are presidents, countries and capitals.
Let's use the same technique to check out the categories column:
# Common categories
value_counts = {}
for word in jeopardy["Category"] :
if word in value_counts:
value_counts[word] += 1
else:
value_counts[word] = 1
# Filter common categories
above_30 = {}
for key, value in value_counts.items():
if value >= 30:
above_30[key] = round((value / 20000) * 100, 3)
above_30
Once again we have capitals and presidents.
Notice that some of these have repeating words. For example, we have history and American history. Let's see which words pop up in the categories most common:
# Common words in categories
word_count = {}
for words in value_counts.keys():
split_categories = words.split()
split_categories = [w for w in split_categories if len(w) > 4]
for word in split_categories:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
# Filter common words in categories
above_30 = {}
for key, value in word_count.items():
if value >= 30:
above_30[key] = round((value / 20000) * 100, 3)
above_30
The best categories to study are listed above.
Jeopardy also has different rounds. The most important round by far is final jeopardy where contestants have the chance to double their money. Let's see if there are any categories that come up a lot in the final jeopardy round:
# Filter final round questions
final_jeopardy = jeopardy[jeopardy["Round"] == "Final Jeopardy!"]
# Categories that appear more than once
counts = final_jeopardy['Category'].value_counts()
counts = counts[counts > 2]
counts
Word orgins is up there again as are other common categories and questions we have seen before like presidents.