We'll be looking for patterns in Jeopardy questions in order to help us win!

In [1]:
import pandas as pd
import csv

#Read in the file
jeopardy = pd.read_csv('jeopardy.csv')
In [2]:
#Explore the file

print(jeopardy.shape)
print(jeopardy.head(5))
(19999, 7)
   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  
In [3]:
#Column names
jeopardy.columns
Out[3]:
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces before the names, we will remove these.

In [4]:
jeopardy.rename(columns = {' Air Date':'Air Date', ' Round':'Round', ' Category': 'Category', ' Value':'Value', ' Question':'Question', ' Answer':'Answer'}, inplace = True)
jeopardy.columns
Out[4]:
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Write a function that normalizes a string.

In [5]:
def normalize_string(string):
    string = string.lower()

    import re
    string = re.sub(r'[^\w\s]', '', string)
    string = re.sub(r'[\s+]', ' ', string)
    return string

Normalize the Question and Answer columns.

In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_string)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_string)
In [7]:
def normalize_num(num):
    import re
    num = re.sub(r'[^\w\s]', '', num)
    try:
        num = int(num)
    except Exception:
        num = 0
    return num

Normalize the value column.

In [8]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_num)
In [9]:
jeopardy.head(3)
Out[9]:
Show Number Air Date Round Category Value Question Answer clean_question clean_answer clean_value
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus 200
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe 200
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona 200

In order to work with Air Date, which has date info, we will convert it to a datetime column.

In [10]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
In [11]:
print(jeopardy['Air Date'].head(5))
0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: Air Date, dtype: datetime64[ns]
In [12]:
jeopardy.head(5)
Out[12]:
Show Number Air Date Round Category Value Question Answer clean_question clean_answer clean_value
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus 200
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe 200
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona 200
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds 200
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams 200

In order to study past questions, we will write a function that counts the number of times words occur in the answer and the question.

In [13]:
def match_in_ans_and_quests(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
        match_by_count = match_count / len(split_answer)
        return match_by_count

Below we count the number of times that the same terms are found in clean_answer and clean_question.

In [14]:
jeopardy['answer_in_question'] = jeopardy.apply(match_in_ans_and_quests, axis = 1)
ans_in_quest_mean = jeopardy['answer_in_question'].mean()
print(ans_in_quest_mean)
0.0297157516605989
In [15]:
jeopardy.dtypes
Out[15]:
Show Number                    int64
Air Date              datetime64[ns]
Round                         object
Category                      object
Value                         object
Question                      object
Answer                        object
clean_question                object
clean_answer                  object
clean_value                    int64
answer_in_question           float64
dtype: object
In [ ]: