We'll be looking for patterns in Jeopardy questions in order to help us win!

In [1]:

import pandas as pd
import csv

#Read in the file
jeopardy = pd.read_csv('jeopardy.csv')

In [2]:

#Explore the file

print(jeopardy.shape)
print(jeopardy.head(5))

(19999, 7)
   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams

In [3]:

#Column names
jeopardy.columns

Out[3]:

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces before the names, we will remove these.

In [4]:

jeopardy.rename(columns = {' Air Date':'Air Date', ' Round':'Round', ' Category': 'Category', ' Value':'Value', ' Question':'Question', ' Answer':'Answer'}, inplace = True)
jeopardy.columns

Out[4]:

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Write a function that normalizes a string.

In [5]:

def normalize_string(string):
    string = string.lower()

    import re
    string = re.sub(r'[^\w\s]', '', string)
    string = re.sub(r'[\s+]', ' ', string)
    return string

Normalize the Question and Answer columns.

In [6]:

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_string)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_string)

In [7]:

def normalize_num(num):
    import re
    num = re.sub(r'[^\w\s]', '', num)
    try:
        num = int(num)
    except Exception:
        num = 0
    return num

Normalize the value column.

In [8]:

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_num)

In [9]:

jeopardy.head(3)

Out[9]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	clean_question	clean_answer	clean_value
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus	for the last 8 years of his life galileo was u...	copernicus	200
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe	no 2 1912 olympian football star at carlisle i...	jim thorpe	200
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona	the city of yuma in this state has a record av...	arizona	200

In order to work with Air Date, which has date info, we will convert it to a datetime column.

In [10]:

jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [11]:

print(jeopardy['Air Date'].head(5))

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: Air Date, dtype: datetime64[ns]

In [12]:

jeopardy.head(5)

Out[12]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	clean_question	clean_answer	clean_value
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus	for the last 8 years of his life galileo was u...	copernicus	200
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe	no 2 1912 olympian football star at carlisle i...	jim thorpe	200
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona	the city of yuma in this state has a record av...	arizona	200
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's	in 1963 live on the art linkletter show this c...	mcdonalds	200
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams	signer of the dec of indep framer of the const...	john adams	200

In order to study past questions, we will write a function that counts the number of times words occur in the answer and the question.

In [13]:

def match_in_ans_and_quests(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
        match_by_count = match_count / len(split_answer)
        return match_by_count

Below we count the number of times that the same terms are found in clean_answer and clean_question.

In [14]:

jeopardy['answer_in_question'] = jeopardy.apply(match_in_ans_and_quests, axis = 1)
ans_in_quest_mean = jeopardy['answer_in_question'].mean()
print(ans_in_quest_mean)

0.0297157516605989

In [15]:

jeopardy.dtypes

Out[15]:

Show Number                    int64
Air Date              datetime64[ns]
Round                         object
Category                      object
Value                         object
Question                      object
Answer                        object
clean_question                object
clean_answer                  object
clean_value                    int64
answer_in_question           float64
dtype: object

In [ ]: