Using Hypothesis Testing to Investigate Patterns in Jeopardy Questions

Zachary Rubens
7 min readFeb 19, 2021

We’re all familiar with the game of Jeopardy, but if you’re not, it is a popular TV show in the US where participants answer questions to win money. It was hosted by the absolute legend Alex Trebek for many years until his unfortunate passing this year. I wrote this project on New Year’s Eve and was building on the hypothetical assumption that I would like to compete in Jeopardy.

In order to gain any sort of competitive advantage, I’m going to take a look at a dataset of Jeopardy questions to see if I can figure out some patterns that might help me win (again — should I decide to compete). We’re going to be using a dataset named jeopardy.csv containing 20000 rows. Here are the columns in the dataset and their explanation (and if you’d like to download it yourself, you can follow this link):

  • Show Number — the Jeopardy episode number of the show this question was in.
  • Air Date — the date the episode aired.
  • Round — the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
  • Category — the category of the question.
  • Value — the number of dollars answering the question correctly is worth.
  • Question — the text of the question.
  • Answer — the text of the answer.

Normalizing text and value columns

In order to avoid mistakes in our analysis, we’re going to normalize all of the text columns (the ‘Question’ and ‘Answer’ columns). Essentially, this is to ensure that words with different cases or punctuations aren’t considered to be different words when compared.

Let’s write a function that:

  • Takes in a string.
  • Converts the string to lowercase.
  • Removes all punctuation in the string.
  • Returns the string.
import re

def normalize_text(text): # takes in string
text = text.lower() # make lower case
text = re.sub('[^A-Za-z0-9\s]', '', text) # removes punctuation
text = re.sub('\s+', ' ', text)
return text # returns string

# apply function to question and answer columns
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

We’re going to complete a couple more steps before getting started. The ‘Value’ column should be numeric to allow us to manipulate it more easily, while the ‘Air Date’ column should be in datetime format, not in string format.

For the Value column, let’s write a function that can:

  • Takes in a string.
  • Removes any punctuation in the string.
  • Converts the string to an integer.
  • If the conversion has an error, assigns 0 instead.
  • Returns the integer.
def normalize_values(text):
text = re.sub('[^A-Za-z0-9\s]', '', text) # remove punctuation
try:
text = int(text)
except Exception: # if conversion
text = 0
return text

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)

Let’s now quickly change the Air Date column to a datetime column.

jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

Is the answer deducible from the question?

Now that our dataset is cleaned up, we’re going to try and answer our first question:

  • How often is the answer deducible from the question?

To do so, we need to see how many times words in the answer also occur in the question. Let’s create a function that takes in a row, compares each item in the question to the answer, and tells us how many matches here are between individual words. The match count for each row will be put into a new column called ‘answer_in_question’. Let's give it a go:

def count_matches(row):
split_answer = row["clean_answer"].split()
split_question = row["clean_question"].split()
if "the" in split_answer:
split_answer.remove("the") # the is the most common word
if len(split_answer) == 0:
return 0
match_count = 0
for item in split_answer:
if item in split_question:
match_count += 1
return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)
jeopardy['answer_in_question'].mean()

According to our analysis, the mean amount of times the answer is already in the question is 0.059 times. That’s only about 6% of the time. Consequently, I don’t think that we’ll be able to rely on certain words in the question being in the answer for our studying strategy for Jeopardy 2021.

How often are questions repeats of older ones?

Since we only have about 10% of the full Jeopardy question dataset, we can’t completely answer this question in full, but we can at least work with what we have.

To figure this out, we’ll have to:

  • Sort jeopardy in order of ascending air date.
  • Maintain a set called terms_used that will be empty initially.
  • Iterate through each row of jeopardy.
  • Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
  • If it does, increment a counter.
  • Add each word to terms_used.

Just like the previous function that removed the word ‘the’, we’re removing words shorter than 6 characters so that we can filter out redundant words like ‘the’ and ‘than’. These words don’t tell us a whole lot about the question unless it happens to be “the the the the the than?”

Let’s give it a go.

question_overlap = [] # creating an empty list
terms_used = set() # creating an empty set of terms used

jeopardy = jeopardy.sort_values('Air Date') # sorting by air date

for i, row in jeopardy.iterrows():
split_question = row['clean_question'].split(' ')
split_question = [q for q in split_question if len(q) > 5]
match_count = 0

for word in split_question: # if word already used
if word in terms_used: # increasing match count
match_count += 1
for word in split_question: # adding word to terms_used
terms_used.add(word)
if len(split_question) > 0:
match_count /= len(split_question)

question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
print(jeopardy['question_overlap'].mean())

According to our results, there seems to be about a 70% overlap between new questions and terms in old questions. Of course, this is not the total amount of questions from Jeopardy, which means that this result is not necessarily significant, but it would give one reason to look into this a bit more.

High value vs low-value questions

Finally, let’s say we wanted to study more high-value questions instead of low-value questions. That would be the perfect way to make more money, right?

Well, we can actually figure which terms correspond to high-value questions using a chi-squared test. First, we’ll need to narrow down questions into two categories based on the ‘Value’ column:

  • Low value — Any row where ‘Value’ is less than 800.
  • High value — Any row where ‘Value’ is greater than 800.

Then we can loop through each of the terms from the last screen and:

  • Find the number of low-value questions the word occurs in.
  • Find the number of high-value questions the word occurs in.
  • Find the percentage of questions the word occurs in.
  • Based on the percentage of questions the word occurs in, find expected counts.
  • Compute the chi-squared value based on the expected counts and the observed counts for high and low-value questions.

Then we can find the words with the biggest differences in usage between high and low-value questions by selecting words with the highest associated chi-squared values. We’ll just do this with a small sample for now, since doing it for all the words would take a while.

def determine_value(row): # determining if high-value or low-value
value = 0
if row['clean_value'] > 800:
value = 1
return value
# creating new column to determine high-value questions
jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)
def high_low_count(word):
low_count = 0
high_count = 0
for i, row in jeopardy.iterrows():
if term in row["clean_question"].split(" "):
if row['high_value'] == 1:
high_count += 1
else:
low_count += 1
return high_count, low_count
from random import choice

terms_used_list = list(terms_used) # making list from terms_used
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
observed_expected.append(high_low_count(word))

observed_expected

Okay, we’ve found the observed counts for a few terms, so now we’ll compute the expected counts and the chi-squared value.

from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for obs in observed_expected:
total = sum(obs)
total_prop = total / jeopardy.shape[0]

expected_high_count = total_prop * high_value_count
expected_low_count = total_prop * low_value_count

observed = np.array([obs[0], obs[1]])
expected = np.array([expected_high_count, expected_low_count])
chi_squared.append(chisquare(observed, expected))

chi_squared

It looks like none of the terms had a significant difference between high-value and low-value rows. Also, the frequencies were all below 5 and so our chi-squared test is not as valid. Further analysis would need to be done with terms with higher frequencies.

Conclusion

It’s looking like Jeopardy questions are a tough nut to crack. We were able to delve into three specific questions:

  • How often is the answer used in the question?
  • How often are terms recycled in questions?
  • Do certain terms in questions relate to more high-value questions?

We didn’t have much luck in looking through each of these questions, but we did get some indication of terms being recycled, which would warrant further analysis.

--

--

Zachary Rubens

Former opera singer turned storyteller | Passion for fitness, wellbeing and mental health