Jeopardy is a TV quiz show in the United States where participants compete to answer general knowledge questions for money.
Our task is to find patterns in a dataset of past Jeopardy questions in order to gain an edge in the competition.
The file can be downloaded here as a JSON file.
In conclusion, the best viable strategy is to understand which broad topics are frequently asked and focus on gaining general knowledge there. This, however, is not a good strategy for someone looking to 'game' the game of Jeopardy.
Let's read in our dataset and gain a better understanding of its contents.
import pandas as pd
import numpy as np
import re
jeopardy = pd.read_json('JEOPARDY_QUESTIONS1.json')
jeopardy.head()
category | air_date | question | value | answer | round | show_number | |
---|---|---|---|---|---|---|---|
0 | HISTORY | 2004-12-31 | 'For the last 8 years of his life, Galileo was... | $200 | Copernicus | Jeopardy! | 4680 |
1 | ESPN's TOP 10 ALL-TIME ATHLETES | 2004-12-31 | 'No. 2: 1912 Olympian; football star at Carlis... | $200 | Jim Thorpe | Jeopardy! | 4680 |
2 | EVERYBODY TALKS ABOUT IT... | 2004-12-31 | 'The city of Yuma in this state has a record a... | $200 | Arizona | Jeopardy! | 4680 |
3 | THE COMPANY LINE | 2004-12-31 | 'In 1963, live on "The Art Linkletter Show", t... | $200 | McDonald\'s | Jeopardy! | 4680 |
4 | EPITAPHS & TRIBUTES | 2004-12-31 | 'Signer of the Dec. of Indep., framer of the C... | $200 | John Adams | Jeopardy! | 4680 |
jeopardy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 216930 entries, 0 to 216929 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 category 216930 non-null object 1 air_date 216930 non-null object 2 question 216930 non-null object 3 value 213296 non-null object 4 answer 216930 non-null object 5 round 216930 non-null object 6 show_number 216930 non-null int64 dtypes: int64(1), object(6) memory usage: 11.6+ MB
jeopardy['category'].value_counts().head(20)
BEFORE & AFTER 547 SCIENCE 519 LITERATURE 496 AMERICAN HISTORY 418 POTPOURRI 401 WORLD HISTORY 377 WORD ORIGINS 371 COLLEGES & UNIVERSITIES 351 HISTORY 349 SPORTS 342 U.S. CITIES 339 WORLD GEOGRAPHY 338 BODIES OF WATER 327 ANIMALS 324 STATE CAPITALS 314 BUSINESS & INDUSTRY 311 ISLANDS 301 WORLD CAPITALS 300 U.S. GEOGRAPHY 299 RELIGION 297 Name: category, dtype: int64
Our data consists of 7 columns with the following meaning:
category
- the category of the questionair_date
- the date the episode airedquestion
- the text of the questionvalue
- the number of dollars the correct answer is worthanswer
- the text of the answerround
- the round of Jeopardyshow_number
- the Jeopardy episode numberThe columns are all string object types with the exception of show_number
which is integer.
There are 216,930 rows of question data. The dataset has a high degree of completeness with the exception of the value
column which hase slightly over 3,000 values missing.
In order perform analysis on the content of the question
and answer
columns, we will need to normalize words with shared meaning into the same format. To do this we will need to change case and punctuation.
pattern = ""
# function that takes in a string and normalizes it
def normalize_string(string):
string = string.lower()
# regex that keeps only alpha characters
pattern = r"[^\w\s]" # any character except word or digit and space, tab, linebreak
string = re.sub(pattern, "", string)
return string
# show the before
print(jeopardy['question'].head(1))
0 'For the last 8 years of his life, Galileo was... Name: question, dtype: object
# apply the functions and add columns
jeopardy['question_clean'] = jeopardy['question'].apply(normalize_string)
jeopardy['answer_clean'] = jeopardy['answer'].apply(normalize_string)
jeopardy['question_clean'].head()
0 for the last 8 years of his life galileo was u... 1 no 2 1912 olympian football star at carlisle i... 2 the city of yuma in this state has a record av... 3 in 1963 live on the art linkletter show this c... 4 signer of the dec of indep framer of the const... Name: question_clean, dtype: object
The value column is in string format. Let's remove unnecessary string elements and convert these to integers.
jeopardy['value']
0 $200 1 $200 2 $200 3 $200 4 $200 ... 216925 $2000 216926 $2000 216927 $2000 216928 $2000 216929 None Name: value, Length: 216930, dtype: object
The only string element we will need to remove is the dollar sign. There also appears to be None values that we will need to consider when we convert to numeric format.
# clean the string
def convert_int(string):
try:
string = string.replace("$", "")
return int(string)
except:
return 0
# apply the function
jeopardy['value_clean'] = jeopardy['value'].apply(convert_int)
jeopardy['value_clean']
0 200 1 200 2 200 3 200 4 200 ... 216925 2000 216926 2000 216927 2000 216928 2000 216929 0 Name: value_clean, Length: 216930, dtype: int64
air_date
to datetime¶The air_date
column is in string format, but we have more flexible analysis options if these were in datetime format.
jeopardy['air_date']
0 2004-12-31 1 2004-12-31 2 2004-12-31 3 2004-12-31 4 2004-12-31 ... 216925 2006-05-11 216926 2006-05-11 216927 2006-05-11 216928 2006-05-11 216929 2006-05-11 Name: air_date, Length: 216930, dtype: object
# convert to date_time
jeopardy['air_date'] = pd.to_datetime(jeopardy['air_date'])
jeopardy['air_date']
0 2004-12-31 1 2004-12-31 2 2004-12-31 3 2004-12-31 4 2004-12-31 ... 216925 2006-05-11 216926 2006-05-11 216927 2006-05-11 216928 2006-05-11 216929 2006-05-11 Name: air_date, Length: 216930, dtype: datetime64[ns]
Stop words are commonly used words such as "the", "a", "an", "in" that provide little value to meaning in a sentence. Let's remove these words from our questions and answers as they create noise in our results.
# import our list of stopwords
stopwords = pd.read_csv('stopwords.csv')
stopwords = list(stopwords['stopwords'])
stopwords
['a', 'able', 'about', 'above', 'abroad', 'according', 'accordingly', 'across', 'actually', 'adj', 'after', 'afterwards', 'again', 'against', 'ago', 'ahead', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'alongside', 'already', 'also', 'although', 'always', 'am', 'amid', 'amidst', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', "a's", 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'back', 'backward', 'backwards', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', 'came', 'can', 'cannot', 'cant', "can't", 'caption', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', "c'mon", 'co', 'co.', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'corresponding', 'could', "couldn't", 'course', "c's", 'currently', 'd', 'dare', "daren't", 'definitely', 'described', 'despite', 'did', "didn't", 'different', 'directly', 'do', 'does', "doesn't", 'doing', 'done', "don't", 'down', 'downwards', 'during', 'e', 'each', 'edu', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'entirely', 'especially', 'et', 'etc', 'even', 'ever', 'evermore', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'f', 'fairly', 'far', 'farther', 'few', 'fewer', 'fifth', 'first', 'five', 'followed', 'following', 'follows', 'for', 'forever', 'former', 'formerly', 'forth', 'forward', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'get', 'gets', 'getting', 'given', 'gives', 'go', 'goes', 'going', 'gone', 'got', 'gotten', 'greetings', 'h', 'had', "hadn't", 'half', 'happens', 'hardly', 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", 'hello', 'help', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', "here's", 'hereupon', 'hers', 'herself', "he's", 'hi', 'him', 'himself', 'his', 'hither', 'hopefully', 'how', 'howbeit', 'however', 'hundred', 'i', "i'd", 'ie', 'if', 'ignored', "i'll", "i'm", 'immediate', 'in', 'inasmuch', 'inc', 'inc.', 'indeed', 'indicate', 'indicated', 'indicates', 'inner', 'inside', 'insofar', 'instead', 'into', 'inward', 'is', "isn't", 'it', "it'd", "it'll", 'its', "it's", 'itself', "i've", 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'know', 'known', 'knows', 'l', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', "let's", 'like', 'liked', 'likely', 'likewise', 'little', 'look', 'looking', 'looks', 'low', 'lower', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', "mayn't", 'me', 'mean', 'meantime', 'meanwhile', 'merely', 'might', "mightn't", 'mine', 'minus', 'miss', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', "mustn't", 'my', 'myself', 'n', 'name', 'namely', 'nd', 'near', 'nearly', 'necessary', 'need', "needn't", 'needs', 'neither', 'never', 'neverf', 'neverless', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'no-one', 'nor', 'normally', 'not', 'nothing', 'notwithstanding', 'novel', 'now', 'nowhere', 'o', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'on', 'once', 'one', 'ones', "one's", 'only', 'onto', 'opposite', 'or', 'other', 'others', 'otherwise', 'ought', "oughtn't", 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'own', 'p', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'possible', 'presumably', 'probably', 'provided', 'provides', 'q', 'que', 'quite', 'qv', 'r', 'rather', 'rd', 're', 'really', 'reasonably', 'recent', 'recently', 'regarding', 'regardless', 'regards', 'relatively', 'respectively', 'right', 'round', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'second', 'secondly', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'shall', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'since', 'six', 'so', 'some', 'somebody', 'someday', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specified', 'specify', 'specifying', 'still', 'sub', 'such', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', "that'll", 'thats', "that's", "that've", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', "there'd", 'therefore', 'therein', "there'll", "there're", 'theres', "there's", 'thereupon', "there've", 'these', 'they', "they'd", "they'll", "they're", "they've", 'thing', 'things', 'think', 'third', 'thirty', 'this', 'thorough', 'thoroughly', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'till', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', "t's", 'twice', 'two', 'u', 'un', 'under', 'underneath', 'undoing', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'upwards', 'us', 'use', 'used', 'useful', 'uses', 'using', 'usually', 'v', 'value', 'various', 'versus', 'very', 'via', 'viz', 'vs', 'w', 'want', 'wants', 'was', "wasn't", 'way', 'we', "we'd", 'welcome', 'well', "we'll", 'went', 'were', "we're", "weren't", "we've", 'what', 'whatever', "what'll", "what's", "what've", 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', "where's", 'whereupon', 'wherever', 'whether', 'which', 'whichever', 'while', 'whilst', 'whither', 'who', "who'd", 'whoever', 'whole', "who'll", 'whom', 'whomever', "who's", 'whose', 'why', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wonder', "won't", 'would', "wouldn't", 'x', 'y', 'yes', 'yet', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've", 'z', 'zero']
jeopardy['question_clean'].head()
0 for the last 8 years of his life galileo was u... 1 no 2 1912 olympian football star at carlisle i... 2 the city of yuma in this state has a record av... 3 in 1963 live on the art linkletter show this c... 4 signer of the dec of indep framer of the const... Name: question_clean, dtype: object
# function that removes stopwords from senteces
def remove_stopwords(sentence):
word_list = sentence.split()
words_left = [word for word in word_list if word not in stopwords]
sentence = ' '.join(words_left)
return sentence
jeopardy['question_clean'] = jeopardy['question_clean'].apply(remove_stopwords)
jeopardy['answer_clean'] = jeopardy['answer_clean'].apply(remove_stopwords)
jeopardy['question_clean'].head()
0 8 years life galileo house arrest espousing ma... 1 2 1912 olympian football star carlisle indian ... 2 city yuma state record average 4055 hours suns... 3 1963 live art linkletter show company served b... 4 signer dec indep framer constitution mass pres... Name: question_clean, dtype: object
In order to figure out whether to study past questions, study general knowledge, or not study at all, it would be helpful to figure out two things:
Let's find out how often the answer to a question can be found within the question itself.
# function that takes in question and answer and finds if there's a match between the words
def answer_in_question(row):
# create lists of words
split_answer = row[-2].split()
split_question = row[-3].split()
match_count = 0
# if 'the' is in the answer, remove it. Common and unnecessary
#new_split_answer = [word for word in split_answer if word != 'the']
# validate to prevent division by zero
if len(split_answer) == 0 or len(split_question) == 0:
return 0
else:
pass
# count the number of times a word appears in the question
for word in split_answer:
if word in split_question:
match_count += 1
words_in_answer_ratio = match_count/len(split_question)
return words_in_answer_ratio
jeopardy['answer_in_question'] = jeopardy.apply(answer_in_question, axis = 1)
jeopardy.head()
category | air_date | question | value | answer | round | show_number | question_clean | answer_clean | value_clean | answer_in_question | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | HISTORY | 2004-12-31 | 'For the last 8 years of his life, Galileo was... | $200 | Copernicus | Jeopardy! | 4680 | 8 years life galileo house arrest espousing ma... | copernicus | 200 | 0.0 |
1 | ESPN's TOP 10 ALL-TIME ATHLETES | 2004-12-31 | 'No. 2: 1912 Olympian; football star at Carlis... | $200 | Jim Thorpe | Jeopardy! | 4680 | 2 1912 olympian football star carlisle indian ... | jim thorpe | 200 | 0.0 |
2 | EVERYBODY TALKS ABOUT IT... | 2004-12-31 | 'The city of Yuma in this state has a record a... | $200 | Arizona | Jeopardy! | 4680 | city yuma state record average 4055 hours suns... | arizona | 200 | 0.0 |
3 | THE COMPANY LINE | 2004-12-31 | 'In 1963, live on "The Art Linkletter Show", t... | $200 | McDonald\'s | Jeopardy! | 4680 | 1963 live art linkletter show company served b... | mcdonalds | 200 | 0.0 |
4 | EPITAPHS & TRIBUTES | 2004-12-31 | 'Signer of the Dec. of Indep., framer of the C... | $200 | John Adams | Jeopardy! | 4680 | signer dec indep framer constitution mass pres... | john adams | 200 | 0.0 |
jeopardy['answer_in_question'].value_counts()
0.000000 201330 0.125000 2386 0.111111 2316 0.142857 2156 0.100000 1678 ... 0.363636 1 0.074074 1 0.150000 1 0.800000 1 0.230769 1 Name: answer_in_question, Length: 62, dtype: int64
jeopardy['answer_in_question'].mean()
0.011056947199412304
1% of words in questions asked will be a part of the answer to the question. Not very helpful.
We want to know what percent of questions asked are repeat questions or new questions. This will help us prioritize our study time.
We can identify repeated questions by seeing how often complex words (6 or more characters) reoccur in the dataset. Let's build a function that counts word occurences in a dictionary.
question_overlap = []
terms_used = set()
sorted_jeopardy = jeopardy.sort_values('air_date')
for index, row in sorted_jeopardy.iterrows():
match_count = 0
split_question = row['question_clean'].split(' ')
# for word in split_question:
# if len(word) < 6:
# split_question.remove(word)
# remove words less than 6 characters
#split_question = [word for word in split_question if len(word) > 5]
# check if a word is in a set, if yes increment, otherwise add
for word in split_question:
if word in terms_used:
match_count += 1
else:
terms_used.add(word)
# get the ratio as long as there is a match
if len(split_question) > 0:
match_count /= len(split_question)
question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()
0.9137122446585912
We see that 91% of questions contain words that are repeats. Let's see if we can get more insight.
word_count = {}
def count_repeated_words(row):
words = row['question_clean'].split()
for word in words:
if len(word) < 6:
return
else:
pass
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
return
jeopardy.apply(count_repeated_words, axis = 1)
0 None 1 None 2 None 3 None 4 None ... 216925 None 216926 None 216927 None 216928 None 216929 None Length: 216930, dtype: object
Our apply function resulted in the creation of a dictionary. Let's explore that add create a new df.
word_count
{'signer': 14, 'winter': 93, '197172': 4, 'record': 86, 'housewares': 2, 'companys': 117, 'accutron': 1, 'outlaw': 16, 'murdered': 19, 'traitor': 9, 'coward': 4, 'worthy': 6, 'africas': 32, 'lowest': 71, 'temperature': 58, 'edward': 141, 'teller': 7, 'geologic': 14, 'kirschner': 1, 'brothers': 163, 'revolutionary': 68, 'single': 92, 'hrefhttpwwwjarchivecommedia20041231_dj_23mp3beyond': 1, 'california': 217, 'steven': 32, '19541955': 1, 'shorter': 15, 'hrefhttpwwwjarchivecommedia20041231_dj_26mp3ripped': 1, 'todays': 16, 'headlines': 14, 'turtle': 10, 'philadelphia': 55, 'modern': 132, 'hrefhttpwwwjarchivecommedia20041231_dj_25mp3somewhere': 1, 'crosscountry': 7, 'skiing': 14, 'referred': 72, 'hrefhttpwwwjarchivecommedia20041231_dj_24mp3500': 1, 'objects': 31, 'largest': 766, 'kingdom': 58, 'united': 116, 'hrefhttpwwwjarchivecommedia20100706_j_22wmvtate': 1, 'huggable': 1, 'peaches': 2, 'island': 568, '5letter': 110, 'separating': 6, 'graphic': 14, 'representation': 15, 'information': 36, 'family': 379, 'history': 146, 'device': 205, 'italian': 396, 'springboard': 3, 'perfected': 6, 'composer': 182, 'wolfgang': 5, 'hrefhttpwwwjarchivecommedia20100706_dj_26jpg': 1, 'target_blankthisa': 481, '4letter': 150, 'pleated': 2, 'perfect': 60, 'hrefhttpwwwjarchivecommedia20100706_dj_27jpg': 1, 'adjective': 362, 'proper': 27, 'mighty': 34, 'hrefhttpwwwjarchivecommedia20100706_dj_14jpg': 1, 'target_blankthis': 33, 'hrefhttpwwwjarchivecommedia20100706_dj_28jpg': 1, 'hrefhttpwwwjarchivecommedia20100706_dj_21wmvjimmy': 1, '11yearold': 9, 'ashlyn': 1, 'hrefhttpwwwjarchivecommedia20100706_dj_29jpg': 1, 'target_blankthesea': 50, 'surprise': 32, 'hrefhttpwwwjarchivecommedia20100706_dj_30jpg': 1, 'falcon': 12, 'cinderella': 13, 'castle': 71, 'mystery': 69, 'conrad': 18, 'begins': 161, 'untamed': 2, 'heifer': 1, 'virgin': 41, 'barbra': 22, 'streisand': 17, 'played': 670, 'stringing': 2, 'capital': 796, 'husband': 118, 'longest': 148, 'edition': 35, 'catholic': 50, 'saturday': 46, 'andrea': 22, 'palladios': 5, 'policy': 22, 'qualify': 5, '15year': 2, 'saintpierre': 1, 'miquelon': 1, 'designer': 62, 'vivienne': 3, 'westwood': 3, 'lerner': 11, 'loewes': 5, 'montserrat': 3, 'warhol': 4, 'capitalist': 1, 'tendencies': 1, 'portrait': 66, 'patrick': 64, 'dennis': 46, 'auntiebr': 1, 'garland': 5, 'jeffreys': 1, 'katharine': 14, 'hamnett': 1, 'created': 295, 'electrabr': 1, 'chiricahua': 6, 'apache': 8, 'popular': 557, 'attraction': 21, 'inventors': 17, 'camerastabilizing': 1, 'madeira': 3, 'islands': 184, 'colchian': 1, 'jilted': 2, 'jasonbr': 1, 'credits': 18, 'cutter': 3, 'exposed': 12, 'unfinished': 14, 'northern': 132, 'mariana': 5, 'manager': 33, 'fausts': 1, 'fiendish': 2, 'bovine': 23, 'minoru': 2, 'yamasaki': 1, 'reached': 32, 'heights': 11, 'quentin': 3, 'tarantino': 3, 'directed': 95, 'student': 79, 'william': 424, 'pereira': 1, 'erected': 10, 'transamerica': 4, 'pyramid': 15, 'compounds': 26, 'element': 218, 'creation': 32, 'charles': 256, 'bulfinch': 4, 'contributed': 11, 'capitol': 18, 'washington': 222, 'hollow': 25, 'dallasfort': 1, 'flower': 132, 'exodus': 57, 'thrown': 19, 'pharaoh': 17, 'mediterranean': 39, 'country': 1065, 'people': 564, 'hollywood': 48, 'euripides': 6, 'porfirio': 1, 'ronaldo': 1, 'secondlargest': 53, 'alaskan': 18, 'vladimir': 5, 'samsonov': 1, 'touted': 3, 'europes': 29, 'exiled': 14, 'manslaughter': 2, 'arabic': 62, 'mainland': 39, 'peninsula': 86, 'closest': 46, 'russia': 64, 'zealandborn': 3, 'appropriatesounding': 1, 'moshoeshoe': 1, 'scientists': 75, 'universe': 22, 'undergoing': 2, 'expansion': 6, 'called': 1340, 'economic': 21, 'bordering': 19, 'numerically': 7, 'speaking': 85, 'shared': 163, 'descriptive': 12, 'nickname': 390, 'robert': 367, 'auditioned': 3, 'feminine': 18, 'holder': 9, 'liquid': 73, 'military': 224, 'hrefhttpwwwjarchivecommedia20060206_j_29jpg': 1, 'target_blankflaga': 7, 'possession': 19, 'golfing': 9, 'previously': 16, 'attached': 20, '50star': 1, 'number': 526, 'ljubljanabr': 1, 'bratislavabr': 1, 'barcelona': 10, 'beheld': 3, 'wretchthe': 2, 'miserable': 6, 'monster': 63, 'needed': 31, 'istanbulbr': 3, 'ottawabr': 3, 'hrefhttpwwwjarchivecommedia20060206_dj_23jpg': 1, 'target_blankjimmy': 674, 'bowlshaped': 3, 'depression': 26, 'impact': 8, 'meteorite': 4, 'sofiabr': 2, 'sarajevobr': 1, 'saigon': 6, 'hrefhttpwwwjarchivecommedia20060206_dj_13jpg': 1, 'target_blanksarah': 729, 'lawyers': 14, 'defend': 5, 'nnegroes': 1, 'atticus': 5, 'bucharestbr': 1, 'bonnbr': 1, 'bookstore': 3, 'hrefhttpwwwjarchivecommedia20060206_dj_04jpg': 1, 'target_blankjon': 355, 'belize': 15, 'citybr': 29, 'guatemala': 11, 'panama': 20, 'hrefhttpwwwjarchivecommedia20060206_dj_15wmva': 1, 'honeycolored': 1, 'retriever': 5, 'lookst': 1, 'clothing': 46, 'outline': 7, 'contents': 5, 'curriculum': 3, 'december': 230, 'goopcom': 1, 'lifestyles': 1, 'website': 97, 'oscarwinning': 43, 'actress': 331, 'represent': 18, 'initials': 44, 'baylorbr': 2, 'stephen': 111, 'austinbr': 1, 'synonym': 207, 'dignity': 2, 'antiochbr': 1, 'bowling': 24, 'greenbr': 4, 'dialogue': 12, 'depaulbr': 1, 'wheatonbr': 1, 'northwestern': 12, 'premier': 24, 'district': 30, 'conservative': 32, 'frakkin': 1, 'gramblingbr': 1, 'mcneese': 1, 'statebr': 8, 'southern': 194, 'huitieme': 1, 'french': 957, 'ordinal': 8, 'meaning': 497, 'deceptive': 4, 'sneaky': 4, 'american': 790, 'idiotbr': 2, 'dookie': 1, 'answer': 39, 'automaticallybr': 1, 'property': 34, 'saysbr': 1, 'controlled': 18, 'nuclear': 47, 'master': 104, 'puppetsbr': 1, 'hrefhttpwwwjarchivecommedia20090508_dj_28jpg': 1, 'target_blankkelly': 489, 'khruschevs': 1, 'secret': 90, 'speech': 68, 'denounces': 2, 'stalin': 10, 'compass': 15, 'direction': 50, 'protogermanic': 1, 'honorbr': 2, 'humans': 46, 'symbol': 234, 'george': 443, 'orwell': 15, 'reviewed': 8, 'beggars': 5, 'banquetbr': 2, 'hrefhttpwwwjarchivecommedia20090508_dj_30jpg': 1, 'amount': 36, 'solution': 13, 'measured': 36, 'saccharometer': 1, 'daedalus': 6, 'substance': 97, 'fasten': 8, 'sunday': 29, 'states': 374, 'complete': 50, 'energy': 61, 'pyrheliometer': 1, 'cadmus': 2, 'planted': 10, 'freddys': 1, 'nightmares': 5, 'horror': 31, 'anthology': 7, 'debuted': 41, 'monroe': 21, 'snohomish': 1, 'restaurant': 65, 'quartets': 3, 'cocommanders': 1, 'odometer': 1, 'measures': 20, 'distance': 32, 'covered': 46, 'vehicle': 40, 'sister': 131, 'orestes': 4, 'mourning': 7, 'contented': 3, 'performing': 17, 'kittens': 4, 'jedediah': 1, 'spirometer': 2, 'capacity': 15, 'organs': 45, 'character': 353, 'hrefhttpwwwjarchivecommedia19961206_j_04wmvherea': 1, 'actionpacked': 2, 'journeys': 6, 'legendary': 100, 'pendleton': 3, 'roundup': 1, 'annual': 92, 'squash': 17, 'pierced': 4, 'senator': 174, 'thomas': 247, 'nilometer': 1, 'height': 26, 'tanglewood': 1, 'remained': 5, 'sedate': 2, 'isthmus': 18, 'connects': 27, 'painted': 139, 'irises': 4, 'historians': 24, 'february': 175, '179495': 1, 'served': 246, 'acting': 39, 'supervisor': 4, 'dumfries': 2, 'scotland': 32, 'discovered': 141, 'sculpture': 41, 'philip': 70, 'morris': 21, 'cincinnatibased': 2, 'favorite': 150, 'british': 654, 'sculptors': 3, 'reclining': 5, 'mother': 203, 'making': 113, 'repairs': 9, 'september': 188, 'hrefhttpwwwjarchivecommedia20101207_j_26jpg': 1, 'originally': 301, 'proverbially': 101, 'hrefhttpwwwjarchivecommedia20101207_j_27jpg': 1, 'hrefhttpwwwjarchivecommedia20101207_j_28jpg': 1, 'everlast': 1, 'leaders': 34, 'cochise': 2, 'mangas': 1, 'coloradas': 1, 'casinos': 6, 'blackjack': 10, 'dealer': 6, 'original': 241, 'provocative': 3, 'architects': 10, 'working': 82, 'messrs': 3, 'netanyahu': 3, 'britten': 6, 'pluralized': 1, 'blanche': 3, 'dubois': 11, 'giving': 40, 'franklin': 66, 'roosevelts': 4, 'caterpillar': 6, 'microtrach': 1, 'oxygen': 25, 'delivery': 14, 'system': 166, 'developed': 138, 'physician': 42, 'maneuver': 8, 'hrefhttpwwwjarchivecommedia20101207_dj_09jpg': 1, 'target_blankdr': 42, 'barbara': 60, 'undershaft': 1, 'elected': 116, 'senate': 92, 'branch': 148, 'pediatrics': 2, 'important': 144, 'mechanism': 8, 'valley': 69, 'eastern': 101, 'european': 204, 'hrefhttpwwwjarchivecommedia20070530_j_09jpg': 1, 'target_blankcheryl': 412, 'hrefhttpwwwjarchivecommedia20070530_j_27jpg': 1, 'target_blankhi': 276, 'mountains': 65, 'summit': 14, 'hrefhttpwwwjarchivecommedia20070530_j_16jpg': 1, 'target_blankalex': 90, 'reports': 129, 'braced': 1, 'framework': 6, 'carrying': 23, 'railroad': 55, 'titles': 54, 'hrefhttpwwwjarchivecommedia20070530_j_17jpg': 1, 'target_blank2': 1, 'paintingsa': 1, 'hrefhttpwwwjarchivecommedia20070530_j_17ajpg': 1, 'target_blankseen': 12, 'considered': 111, 'healthiest': 1, 'headquarters': 29, 'compound': 34, 'langley': 2, 'virginia': 125, 'voiced': 23, 'quickly': 19, 'hrefhttpwwwjarchivecommedia20070530_j_20jpg': 1, 'boeing': 10, 'manufacturing': 7, 'voices': 7, 'beavis': 4, 'butthead': 4, 'cruise': 23, 'bestselling': 39, 'passenger': 17, 'hawaiian': 62, 'wreath': 4, 'defending': 5, 'soldiers': 64, 'resort': 50, 'antabuse': 1, 'designed': 143, 'transport': 21, 'landing': 28, 'founded': 484, 'estramustine': 1, 'chemotherapy': 1, 'cabbies': 2, 'gentle': 13, 'gritty': 3, 'pravastatin': 1, 'sixteen': 6, 'candles': 17, 'plymouth': 14, 'plantation': 17, 'tranquilizer': 3, 'sounds': 137, 'village': 50, 'introduced': 240, 'replenish': 1, 'richard': 242, 'attenborough': 2, 'custers': 5, 'trainers': 2, 'doorbr': 1, 'nobelbr': 1, 'scotts': 5, 'familiar': 84, 'hrefhttpwwwjarchivecommedia19971110_j_17jpg': 1, 'tubesbr': 1, 'doughnutsbr': 1, 'presidents': 110, 'seminole': 2, 'indian': 171, 'leader': 229, 'osceola': 3, 'buried': 59, 'magazine': 133, 'peeling': 2, 'onionsbr': 1, 'watching': 19, 'kennedy': 48, 'center': 186, 'invites': 6, 'public': 98, 'wyoming': 27, 'monument': 65, '865foothigh': 1, 'fluted': 3, 'column': 33, 'igneous': 8, 'antoinette': 1, 'concellos': 1, 'triple': 24, 'somersault': 2, 'helped': 149, 'detroitborn': 2, 'broadcaster': 5, 'bobbybr': 1, 'castillo': 3, 'hollywoodbr': 2, 'saladbr': 2, 'genuine': 10, 'tempera': 1, 'painting': 119, 'breaking': 25, 'goahead': 1, 'hrefhttpwwwjarchivecommedia19971110_dj_20jpg': 1, 'target_blankvideo': 7, 'surrealists': 3, 'hrefhttpwwwjarchivecommedia19971110_dj_27jpg': 1, 'preparing': 8, 'steichen': 2, 'coastal': 21, 'waters': 22, 'national': 636, 'jurisdiction': 6, 'hrefhttpwwwjarchivecommedia19971110_dj_28jpg': 1, 'battle': 191, 'football': 97, 'observing': 3, 'pilgrims': 39, 'traveling': 29, 'shrine': 16, 'becket': 8, 'inspired': 125, 'ferried': 1, 'acolytes': 1, 'subdeacon': 1, 'business': 101, 'project': 36, 'authors': 135, 'technothriller': 1, 'rainbow': 29, 'focuses': 6, 'jackson': 79, 'hormel': 4, 'product': 99, 'simply': 34, 'spiced': 3, 'author': 589, 'university': 382, 'publication': 25, 'golden': 120, 'spanishnamed': 1, 'appetizer': 13, 'tortilla': 10, 'paperthin': 4, 'dessert': 63, 'equivalent': 62, 'pancake': 6, 'current': 172, 'events': 61, 'hrefhttpwwwjarchivecommedia20020611_j_11wmvjimmy': 1, 'attorney': 40, 'hutchinson': 4, 'letters': 75, 'scandal': 22, 'famous': 845, 'landmark': 69, 'composed': 103, '19041905': 2, 'traditionally': 119, 'shepherds': 12, 'confuse': 3, 'mammal': 78, 'running': 88, 'presence': 11, 'bizets': 4, 'carmen': 17, 'sandwich': 45, 'depicting': 9, 'reminiscent': 4, 'highway': 40, 'heaven': 34, 'series': 243, 'sukkot': 3, 'jewish': 105, 'festival': 70, 'phrase': 223, 'precedes': 161, 'defeated': 38, 'dunsinane': 1, 'malcolm': 28, 'offbeat': 3, 'partial': 7, 'church': 120, 'countrys': 374, 'christian': 99, 'denomination': 14, 'heartbreaking': 1, 'caught': 36, 'hrefhttpwwwjarchivecommedia20020611_dj_19jpg': 1, 'target_blankjeff': 35, 'probst': 32, 'marquesasa': 1, 'hundreds': 17, 'springs': 21, 'cohosted': 2, 'common': 364, 'marshs': 1, 'library': 118, 'comedy': 93, 'architect': 63, 'gordon': 26, 'bunshaft': 1, 'presidential': 113, 'austin': 19, 'founding': 33, 'prophet': 48, 'mormonism': 1, 'verona': 6, 'sterling': 8, 'memorial': 68, 'oklahoma': 42, 'salinas': 6, 'edmunds': 2, 'produced': 105, 'bremsstrahlung': 1, 'german': 380, 'radiation': 21, 'halloween': 24, 'recorded': 50, 'phoebe': 9, 'longfellow': 40, 'administrative': 19, 'pomegranate': 5, 'longtime': 41, 'anchor': 11, 'oliver': 64, 'wendell': 6, 'holmes': 43, 'teaspoon': 2, 'cabinda': 1, 'appointed': 49, 'conductor': 35, 'boston': 98, 'descended': 18, 'abyssinian': 1, 'prince': 194, 'scotch': 11, 'whisky': 5, 'singers': 36, 'narrow': 36, 'passage': 26, 'separates': 34, 'canadas': 64, 'ellesmere': 2, 'yankee': 34, 'greatest': 96, 'drawing': 23, 'genesis': 98, 'detroit': 37, 'astronomer': 46, 'documented': 2, 'oneonone': 1, 'occasional': 2, 'misspelling': 3, 'noticable': 1, 'telecast': 3, 'jeopardy': 53, 'israel': 33, 'october': 245, 'bodyguards': 2, 'satwant': 1, 'supersede': 1, 'competition': 27, 'excede': 1, 'expectations': 4, 'appropriately': 194, 'glasgow': 3, 'canadian': 138, 'province': 104, 'international': 140, 'insect': 71, 'gangster': 17, 'reallife': 14, 'committed': 10, 'catagorizing': 1, 'collectibles': 2, 'cohesively': 1, 'border': 81, 'including': 90, 'distinguished': 16, 'marines': 13, 'honors': 22, 'pausanius': 1, 'accomodations': 1, 'monastery': 9, 'rudimentary': 3, 'hrefhttpwwwjarchivecommedia20100208_j_20jpg': 1, 'stands': 109, 'double': 89, 'unusual': 31, 'selfsatisfied': 2, 'burning': 28, 'moscow': 33, 'napoleons': 18, 'hrefhttpwwwjarchivecommedia20100208_dj_10jpg': 1, 'friends': 91, 'neutral': 13, 'reaction': 13, 'living': 134, 'plants': 81, 'underground': 28, 'hrefhttpwwwjarchivecommedia20100208_dj_27bjpg': 1, 'instrument': 184, 'religion': 95, 'corruption': 14, 'office': 103, 'libyas': 5, 'norman': 59, 'addition': 82, 'helping': 22, 'commerce': 15, 'venezuela': 19, 'season': 90, 'breathes': 1, 'heavily': 11, 'introduction': 20, 'oregon': 44, 'entered': 18, 'believed': 107, 'beatles': 81, 'australian': 77, 'include': 346, 'charlottes': 6, 'western': 155, 'camper': 1, 'pitches': 3, 'canada': 62, 'celebrates': 36, 'critic': 27, 'walter': 72, 'examples': 48, 'vietnamese': 15, 'fences': 6, 'frankenstein': 7, 'invisible': 15, 'animated': 64, 'relationship': 29, 'lizzie': 2, 'borden': 7, 'lowlevel': 2, 'fertile': 12, 'marshland': 1, 'vietnams': 8, 'southernmost': 46, 'region': 145, 'nominated': 48, 'oscars': 15, 'studied': 44, 'convent': 4, 'airheaded': 1, 'blonde': 29, 'holden': 9, 'gloria': 17, 'vanderbilt': 10, 'narrowest': 3, 'hanoverian': 4, 'heritage': 22, 'colonists': 17, 'monarch': 50, 'georgie': 3, 'geordie': 1, 'colorful': 180, 'nicknames': 29, 'agricultural': 22, 'chemist': 35, 'peanut': 22, '120foot': 1, 'january': 274, 'animal': 242, 'courageous': 9, 'englands': 62, 'political': 137, 'satire': 10, 'starred': 126, 'father': 338, 'michael': 195, 'douglas': 46, 'andrew': 85, 'shepherd': 22, 'thisbr': 7, 'fluffy': 10, 'comrades': 2, 'irving': 36, 'berlin': 33, 'selfmade': 4, 'aiwabr': 1, 'allergy': 6, 'feathers': 8, 'prevent': 36, 'playing': 131, 'odette': 4, 'politicians': 13, 'complain': 10, 'appearances': 3, 'unappetizing': 1, 'poultry': 10, 'circuit': 8, 'tennessee': 71, 'practicing': 9, 'colony': 64, 'striking': 13, 'resemblance': 16, 'vermont': 34, 'hyperion': 2, 'booksbr': 1, 'mammoth': 13, 'recordsbr': 1, 'miramax': 1, 'consisting': 22, 'sheeps': 5, 'minced': 5, 'bureau': 38, 'treasury': 29, 'department': 88, 'wwwmoneyfactorycom': 1, 'president': 782, 'impeached': 6, 'official': 287, 'presides': 6, 'qattara': 2, 'points': 37, 'folded': 5, 'filled': 25, 'ingredients': 11, 'mackinac': 2, 'bridge': 98, 'theyre': 346, 'unlucky': 11, 'middle': 289, 'guillaume': 4, 'cleaned': 8, 'bountiful': 1, 'wildlife': 8, 'desert': 87, 'punish': 4, 'vronsky': 2, 'turning': 30, 'daniel': 87, 'yellow': 91, 'cheddar': 5, 'bacteria': 19, 'carbon': 20, 'dioxide': 6, 'bubbles': 9, '8yearold': 2, 'newton': 19, 'cassegrain': 1, 'schmidt': 6, 'maksutov': 1, 'dodger': 20, 'outfielder': 11, 'flexible': 14, 'change': 94, 'goodbye': 33, 'publishing': 19, 'mission': 47, 'statement': 18, 'school': 215, 'located': 172, 'inindianapolis': 1, 'americas': 107, 'livable': 1, 'cities': 83, 'minister': 50, '1968br': 40, 'marching': 15, 'peanuts': 23, 'snoopy': 7, 'fancied': 2, 'flying': 77, 'communist': 37, '1938br': 6, 'yesssssssshe': 1, 'excessive': 5, 'bureaucratic': 2, 'procedure': 36, 'resulting': 6, 'inaction': 1, 'physicist': 52, '1955br': 6, 'watchmaker': 1, 'geographic': 57, 'arctic': 46, 'circle': 42, 'governor': 150, '1963br': 7, 'segregation': 6, 'tomorrow': 9, 'smallest': 127, 'fabric': 52, 'professor': 63, '1967br': 34, 'arundelbr': 1, 'johnny': 100, 'tremain': 2, 'emblem': 21, 'citizen': 21, 'canonized': 7, 'immigrant': 9, 'crooners': 5, 'francis': 57, 'ichabods': 1, 'patronymic': 1, 'hillbilly': 3, 'shakespeare': 135, 'singer': 232, 'foresaw': 2, 'aquatic': 22, 'fernando': 11, 'stoichiometry': 1, 'defined': 87, 'organization': 138, 'employed': 14, 'alexander': 112, 'waverly': 2, 'devoted': 67, 'fittingly': 41, 'sudsbr': 1, 'eberhard': 1, 'anheuser': 1, 'numbers': 49, 'appears': 58, 'position': 56, 'strike': 32, 'pharmaceuticalsbr': 1, 'bristol': 6, 'colonize': 1, 'stretch': 13, 'closely': 30, 'advertisingbr': 1, 'chapter': 133, 'awarded': 25, 'hazard': 11, 'ground': 50, 'enginesbr': 1, 'briggs': 1, 'toronado': 1, 'eliminated': 5, 'brecht': 7, 'weills': 1, 'fashionbr': 1, ...}
repeated_words = pd.DataFrame.from_dict(word_count, orient = 'index', columns = ['times_repeated'])
repeated_words.sort_values('times_repeated', ascending = False).head(30)
times_repeated | |
---|---|
called | 1340 |
country | 1065 |
french | 957 |
famous | 845 |
capital | 796 |
american | 790 |
president | 782 |
largest | 766 |
target_blankherea | 748 |
target_blanksarah | 729 |
target_blankjimmy | 674 |
played | 670 |
british | 654 |
national | 636 |
author | 589 |
island | 568 |
people | 564 |
popular | 557 |
number | 526 |
meaning | 497 |
target_blankkelly | 489 |
founded | 484 |
target_blankthisa | 481 |
english | 462 |
george | 443 |
william | 424 |
company | 420 |
ancient | 419 |
target_blankcheryl | 412 |
italian | 396 |
This approach of focusing on repeated questions may not be helpful. Most of the top repeated words aren't proper nouns, and of those that are most are broad or ambiguous.
If our focus is earning points in the game of Jeopardy it will be helpful to know the types of questions that are high value. We can figure this out using a chi-squared test.
First, we will need to categorize our questions into two categories: low value and high value. High value will be questions valued over 800 points.
# categorization function based on value
def value_categorization(row):
if row['value_clean'] > 800:
return 1
else :
return 0
# apply function to dataframe and create new column
jeopardy['high_value'] = jeopardy.apply(value_categorization, axis = 1)
With our rows now categorized, let's create another function that takes in a word and returns the count of high and low value questions the word occurs in. Our goal here is to create a set of tables. The first being our being like the below, where for each category we count the number of times a word is in a high value question or a low value question. Let's build that now.
word list | High Value Count | Low Value Count | |
---|---|---|---|
Word 1 | Word 1 High Count | Word 1 Low Count | |
Word 2 | Word 2 High Count | Word 2 Low Count | |
... | ... | ... | |
Word n | Word n High Count | Word n Low Count |
def word_value(word):
low_count = 0
high_count = 0
for i, row in jeopardy.iterrows():
if word in row['question_clean'].split(" "):
if row['high_value'] == 1:
high_count += 1
else:
low_count += 1
return high_count, low_count
Let's now randomly pick 10 words from our terms_used
set that we created in a previous step and append them to a new list. Then we will apply our word_value function to a new list.
import random
# choose 10 random words
comparison_terms = random.sample(terms_used, k = 10)
observed_expected = []
# apply our function to each word
for word in comparison_terms:
observed_expected.append(word_value(word))
# display the count of high and low value questions for each word
observed_expected
[(0, 1), (0, 1), (0, 1), (16, 58), (0, 2), (1, 0), (3, 4), (1, 0), (0, 1), (0, 2)]
Now that we have the observed counts of high and low word values for a few terms we can compute the expected counts and the chi-squared value. The expected counts is calculated by first calculating two constants, the total count for high value words and the total count for low value words. Then, for each word we sum the total count occurance for the word. Finally, we calculate the total count of all words, which is found by adding together the totals of high value and low value words. This will result in another table that looks like the below.
word list | High Value Count | Low Value Count | (sum column) |
---|---|---|---|
Word 1 | Word 1 High Count | Word 1 Low Count | Total Count Word 1 |
Word 2 | Word 2 High Count | Word 2 Low Count | Total Count Word 2 |
... | ... | ... | ... |
Word n | Word n High Count | Word n Low Count | Total Count Word n |
(sum row) | Total Count High Value | Total Count Low Value | Total Count All Words |
From here, we calculate the expected values for each of word in the low value count and high value count columns. This will look like the below.
word list | Expected High Value Count | Expected Low Value Count |
---|---|---|
Word 1 | (Total Count Word 1 x Total Count High Value) / Total Count All Words | (Total Count Word 1 x Total Count Low Value) / Total Count All Words |
Word 2 | (Total Count Word 2 x Total Count High Value) / Total Count All Words | (Total Count Word 2 x Total Count Low Value) / Total Count All Words |
... | ... | ... |
Word n | (Total Count Word n x Total Count High Value) / Total Count All Words | (Total Count Word n x Total Count Low Value) / Total Count All Words |
Each cell represents the expected value for that word and value combination.
After this we will calculate the chi-square by following the below steps.
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
print(high_value_count)
print(low_value_count)
53029 163901
from scipy.stats import chisquare
import numpy as np
chi_squared = []
for s in observed_expected:
high = s[0]
low = s[1]
total = high + low
total_prop = total / jeopardy.shape[0]
expected_high = high_value_count * total_prop
expected_low = low_value_count * total_prop
dof = (10 - 1)*(2-1)
observed = np.array([s[0], s[1]])
expected = np.array([expected_high, expected_low])
chi_squared.append(chisquare(observed, expected))
chi_squared
[Power_divergenceResult(statistic=0.3235428703912728, pvalue=0.5694862483821648), Power_divergenceResult(statistic=0.3235428703912728, pvalue=0.5694862483821648), Power_divergenceResult(statistic=0.3235428703912728, pvalue=0.5694862483821648), Power_divergenceResult(statistic=0.31943281316869954, pvalue=0.5719487029690409), Power_divergenceResult(statistic=0.6470857407825455, pvalue=0.4211565342143838), Power_divergenceResult(statistic=3.0907805163212583, pvalue=0.07873703216726466), Power_divergenceResult(statistic=1.2848157961645266, pvalue=0.25700553604012244), Power_divergenceResult(statistic=3.0907805163212583, pvalue=0.07873703216726466), Power_divergenceResult(statistic=0.3235428703912728, pvalue=0.5694862483821648), Power_divergenceResult(statistic=0.6470857407825455, pvalue=0.4211565342143838)]
Generally we require a pvalue of .05 or below in order to consider the results significant. One of our words is close, however, indicating that if we take an approach aside from random sampling we may find words that meet the threshold.
Let's increase our focus by narrowing our word choices to those frequently used.
frequently_used_over_50 = repeated_words.loc[repeated_words['times_repeated'] > 50]
frequently_used_over_50
times_repeated | |
---|---|
winter | 93 |
record | 86 |
companys | 117 |
lowest | 71 |
temperature | 58 |
... | ... |
swedish | 56 |
johnson | 70 |
indians | 55 |
clinton | 54 |
released | 53 |
657 rows × 1 columns
We're left with a managable list of words with which we can rerun our function.
frequently_used_df = pd.DataFrame(columns=['word','high','low'])
# apply our function to each word
index = 0
for word in list(frequently_used_over_50.index):
frequently_used_df.loc[index, 'word'] = (word)
high_low_tup = word_value(word)
frequently_used_df.loc[index, 'high'] = (high_low_tup[0])
frequently_used_df.loc[index, 'low'] = (high_low_tup[1])
index += 1
frequently_used_df
word | high | low | |
---|---|---|---|
0 | winter | 106 | 266 |
1 | record | 131 | 586 |
2 | companys | 67 | 260 |
3 | lowest | 37 | 126 |
4 | temperature | 40 | 145 |
... | ... | ... | ... |
652 | swedish | 61 | 97 |
653 | johnson | 63 | 190 |
654 | indians | 42 | 155 |
655 | clinton | 49 | 162 |
656 | released | 50 | 231 |
657 rows × 3 columns
frequently_used_df['high'] = frequently_used_df['high'].astype(int)
frequently_used_df['low'] = frequently_used_df['low'].astype(int)
frequently_used_df['low'].dtype
dtype('int32')
frequently_used_df.iloc[:,1]
0 106 1 131 2 67 3 37 4 40 ... 652 61 653 63 654 42 655 49 656 50 Name: high, Length: 657, dtype: int32
frequently_used_df
word | high | low | |
---|---|---|---|
0 | winter | 106 | 266 |
1 | record | 131 | 586 |
2 | companys | 67 | 260 |
3 | lowest | 37 | 126 |
4 | temperature | 40 | 145 |
... | ... | ... | ... |
652 | swedish | 61 | 97 |
653 | johnson | 63 | 190 |
654 | indians | 42 | 155 |
655 | clinton | 49 | 162 |
656 | released | 50 | 231 |
657 rows × 3 columns
def chi_square(row):
high = row['high']
low = row['low']
total = high + low
total_proportion = total / jeopardy.shape[0]
expected_high = jeopardy[jeopardy['high_value'] == 1].shape[0] * total_proportion
expected_low = jeopardy[jeopardy['high_value'] == 0].shape[0] * total_proportion
observed = np.array([high, low])
expected = np.array([expected_high, expected_low])
chi_square, p_value = chisquare(observed, expected)
return expected_high, expected_low, chi_square, p_value
results = frequently_used_df.apply(chi_square, axis = 1)
results_df = pd.DataFrame(list(results), columns = ['expected_high', 'expected_low', 'chi_square','p_value'])
results_df
expected_high | expected_low | chi_square | p_value | |
---|---|---|---|---|
0 | 90.936191 | 281.063809 | 3.302713 | 0.069166 |
1 | 175.272175 | 541.727825 | 14.800854 | 0.000119 |
2 | 79.935846 | 247.064154 | 2.770678 | 0.096005 |
3 | 39.845697 | 123.154303 | 0.268989 | 0.604011 |
4 | 45.223644 | 139.776356 | 0.798582 | 0.371518 |
... | ... | ... | ... | ... |
652 | 38.623436 | 119.376564 | 17.158286 | 0.000034 |
653 | 61.846388 | 191.153612 | 0.028480 | 0.865985 |
654 | 48.157069 | 148.842931 | 1.041900 | 0.307380 |
655 | 51.579399 | 159.420601 | 0.170726 | 0.679468 |
656 | 68.691048 | 212.308952 | 6.731396 | 0.009473 |
657 rows × 4 columns
frequently_used_df = frequently_used_df.merge(results_df, left_index = True, right_index = True)
frequently_used_df
word | high | low | expected_high | expected_low | chi_square | p_value | |
---|---|---|---|---|---|---|---|
0 | winter | 106 | 266 | 90.936191 | 281.063809 | 3.302713 | 0.069166 |
1 | record | 131 | 586 | 175.272175 | 541.727825 | 14.800854 | 0.000119 |
2 | companys | 67 | 260 | 79.935846 | 247.064154 | 2.770678 | 0.096005 |
3 | lowest | 37 | 126 | 39.845697 | 123.154303 | 0.268989 | 0.604011 |
4 | temperature | 40 | 145 | 45.223644 | 139.776356 | 0.798582 | 0.371518 |
... | ... | ... | ... | ... | ... | ... | ... |
652 | swedish | 61 | 97 | 38.623436 | 119.376564 | 17.158286 | 0.000034 |
653 | johnson | 63 | 190 | 61.846388 | 191.153612 | 0.028480 | 0.865985 |
654 | indians | 42 | 155 | 48.157069 | 148.842931 | 1.041900 | 0.307380 |
655 | clinton | 49 | 162 | 51.579399 | 159.420601 | 0.170726 | 0.679468 |
656 | released | 50 | 231 | 68.691048 | 212.308952 | 6.731396 | 0.009473 |
657 rows × 7 columns
significant_fu = frequently_used_df[(frequently_used_df['p_value'] <= .05)]
significant_fu.sort_values('chi_square', ascending=False).head(20)
word | high | low | expected_high | expected_low | chi_square | p_value | |
---|---|---|---|---|---|---|---|
23 | target_blankthisa | 727 | 889 | 395.034638 | 1220.965362 | 369.222686 | 2.763376e-82 |
323 | target_blankherea | 960 | 1479 | 596.218739 | 1842.781261 | 293.773786 | 7.486874e-66 |
65 | target_blankjimmy | 323 | 351 | 164.760734 | 509.239266 | 201.146655 | 1.173842e-45 |
66 | target_blanksarah | 323 | 406 | 178.205601 | 550.794399 | 155.711321 | 9.789275e-36 |
122 | reports | 488 | 741 | 300.431665 | 928.568335 | 154.992741 | 1.405336e-35 |
78 | target_blankkelly | 229 | 260 | 119.537090 | 369.462910 | 132.668955 | 1.068178e-30 |
67 | target_blankjon | 171 | 184 | 86.780505 | 268.219495 | 108.178537 | 2.456252e-25 |
74 | french | 816 | 1626 | 596.952095 | 1845.047905 | 106.384104 | 6.074304e-25 |
118 | target_blankcheryl | 190 | 222 | 100.714276 | 311.285724 | 104.763746 | 1.376021e-24 |
153 | author | 564 | 1114 | 410.190670 | 1267.809330 | 76.333926 | 2.395333e-18 |
21 | italian | 358 | 630 | 241.518702 | 746.481298 | 74.353198 | 6.531937e-18 |
182 | german | 370 | 684 | 257.652542 | 796.347458 | 64.838070 | 8.131289e-16 |
403 | delivers | 179 | 259 | 107.070032 | 330.929968 | 63.957255 | 1.271482e-15 |
636 | target_blankseena | 84 | 89 | 42.290218 | 130.709782 | 54.446997 | 1.596976e-13 |
26 | adjective | 201 | 346 | 133.715314 | 413.284686 | 44.811484 | 2.169465e-11 |
646 | painter | 104 | 140 | 59.646319 | 184.353681 | 43.652960 | 3.920836e-11 |
75 | meaning | 487 | 1059 | 377.922989 | 1168.077011 | 41.667856 | 1.081729e-10 |
22 | composer | 194 | 338 | 130.048532 | 401.951468 | 41.623021 | 1.106823e-10 |
354 | african | 275 | 542 | 199.717388 | 617.282612 | 37.558781 | 8.869919e-10 |
419 | philosopher | 71 | 93 | 40.090149 | 123.909851 | 31.542359 | 1.951367e-08 |
print("Of our frequently used words,",len(frequently_used_df[(frequently_used_df['expected_high'] < frequently_used_df['expected_low'])]), "have a statistically significant higher use in low value questions.")
print("Of our frequently used words,",len(frequently_used_df[(frequently_used_df['expected_high'] > frequently_used_df['expected_low'])]), "have a statistically significant higher use in high value questions.")
Of our frequently used words, 657 have a statistically significant higher use in low value questions. Of our frequently used words, 0 have a statistically significant higher use in high value questions.
Our strategy of identifying for study frequently used, significantly higher value terms is not viable. We took the terms that were used more than 50 times and ran a chi-square test to determine which of those were used disproportiantely in either high or low categories. Of those that passed the significance threshold, all of them were disproportiantely used in lower value questions. Thinking about this logically, Jeopardy is likely to set low value questions as those that are more common knowledge, while the high value questions are those that are niche and less known. In this way the difficulty scales with the points at stake.
One strategy could be to go for the low value questions, however because there are so many topics in this bucket there is not a clear starting point.
In conclusion, the best viable strategy is to understand which broad topics are frequently asked and focus on gaining general knowledge there. This, however, is not a good strategy for someone looking to 'game' the game of Jeopardy.