We'll be looking for patterns in Jeopardy questions in order to help us win!
import pandas as pd
import csv
#Read in the file
jeopardy = pd.read_csv('jeopardy.csv')
#Explore the file
print(jeopardy.shape)
print(jeopardy.head(5))
(19999, 7) Show Number Air Date Round Category Value \ 0 4680 2004-12-31 Jeopardy! HISTORY $200 1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Question Answer 0 For the last 8 years of his life, Galileo was ... Copernicus 1 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe 2 The city of Yuma in this state has a record av... Arizona 3 In 1963, live on "The Art Linkletter Show", th... McDonald's 4 Signer of the Dec. of Indep., framer of the Co... John Adams
#Column names
jeopardy.columns
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value', ' Question', ' Answer'], dtype='object')
Some of the column names have spaces before the names, we will remove these.
jeopardy.rename(columns = {' Air Date':'Air Date', ' Round':'Round', ' Category': 'Category', ' Value':'Value', ' Question':'Question', ' Answer':'Answer'}, inplace = True)
jeopardy.columns
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer'], dtype='object')
Write a function that normalizes a string.
def normalize_string(string):
string = string.lower()
import re
string = re.sub(r'[^\w\s]', '', string)
string = re.sub(r'[\s+]', ' ', string)
return string
Normalize the Question and Answer columns.
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_string)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_string)
def normalize_num(num):
import re
num = re.sub(r'[^\w\s]', '', num)
try:
num = int(num)
except Exception:
num = 0
return num
Normalize the value column.
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_num)
jeopardy.head(3)
Show Number | Air Date | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_value | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus | 200 |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe | no 2 1912 olympian football star at carlisle i... | jim thorpe | 200 |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona | 200 |
In order to work with Air Date, which has date info, we will convert it to a datetime column.
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
print(jeopardy['Air Date'].head(5))
0 2004-12-31 1 2004-12-31 2 2004-12-31 3 2004-12-31 4 2004-12-31 Name: Air Date, dtype: datetime64[ns]
jeopardy.head(5)
Show Number | Air Date | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_value | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus | 200 |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe | no 2 1912 olympian football star at carlisle i... | jim thorpe | 200 |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona | 200 |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's | in 1963 live on the art linkletter show this c... | mcdonalds | 200 |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams | signer of the dec of indep framer of the const... | john adams | 200 |
In order to study past questions, we will write a function that counts the number of times words occur in the answer and the question.
def match_in_ans_and_quests(row):
split_answer = row['clean_answer'].split()
split_question = row['clean_question'].split()
match_count = 0
if 'the' in split_answer:
split_answer.remove('the')
if len(split_answer) == 0:
return 0
for item in split_answer:
if item in split_question:
match_count += 1
match_by_count = match_count / len(split_answer)
return match_by_count
Below we count the number of times that the same terms are found in clean_answer and clean_question.
jeopardy['answer_in_question'] = jeopardy.apply(match_in_ans_and_quests, axis = 1)
ans_in_quest_mean = jeopardy['answer_in_question'].mean()
print(ans_in_quest_mean)
0.0297157516605989
jeopardy.dtypes
Show Number int64 Air Date datetime64[ns] Round object Category object Value object Question object Answer object clean_question object clean_answer object clean_value int64 answer_in_question float64 dtype: object