In the early days of COVID-19 I watched a fair amount of Jeopardy. Me and my wife, Katie, like to play along and see if we can get a respectable number of answers right. After watching a few episodes on Netflix I started to wonder if there was a way that we could play Jeopardy on the PC or Nintendo Switch.
The reviews for the Switch version didn’t look very good and the most recent PC version wasn’t available digitally, so I figured why not make my own quiz game? In the end the game I made doesn’t have much in common with Jeopardy. It’s more of a drop-in arena where any number of players attempt to answer the same multiple choice question. Players can play forever and rack up obscenely high scores if they are so inclined. The game itself is very basic and nothing impressive to look at or play, but I feel like the underlying technology that powers it is worth writing about.
If you’re interested, give it a try! It’s definitely no masterpiece, but it’ll quiz you about music from 1950-2005. Most of the time the questions and provided answers will make sense! If you’re stuck on the name entry screen it’s because I made the websocket server a little shoddily and it crashed. That’s not really important for this article though - I want to focus on how I created the question and answer sets rather than the game itself. If you’re interested you can take a look at the source code for the python server containing the NLP code and the front-end client.
The most important part of most quiz games are the questions - creating a diverse set of questions (and answers, if you’re making something that involves multiple choice) can be a time-consuming task. I figured that Wikipedia would be a great place to get questions from because of the way that articles link and present data in a relatively consistent fashion. Being able to make use of hyperlinked data allows for easy identification of what we can remove from a sentence to make a question. Following the data behind the hyperlinks provides more information about the possible answer which helps us categorize the data.
Take a look at this excerpt from the Wikipedia article about 1984 in music:
This sentence gives us three different answers we can remove to quiz the player about: May 2nd, Lionel Richie, and Motown. These answers can be categorized as being a date, a person (male), and an organization. While it’d be nice to not have to categorize people according to their gender, it makes the possible answers presented to the user jarring at times, especially when the rest of the sentence uses gendered pronouns. As usual, the relationship between gender and language is complex and we must respect that here so that sentences and questions read properly.
Before this project I had never really done much in the way of machine learning. I chose to use Python simply because I’ve heard that it’s a really good language for ML tasks thanks to a robust set of libraries. I have to admit that I feel a little clumsy with Python, but the diverse set of libraries made my experience a pleasurable one.
I started by looking at the Wikipedia API, but quickly determined that it was more than what I needed. Since many wikipedia pages are a series of lists I chose to simply scrape the HTML. Choosing this approach is prone to break if Wikipedia updates their markup code, but considering how little Wikipedia’s appearance has changed over the last decade I figured this was a safe approach for a side project.
I used BeautifulSoup4 to perform the scraping and mainly focused on lists that are seen in “x year in music” articles. This would give me a pretty wide set of data since every year has several bullet point lists of events that happened that year.
After doing some searching online I realized I’d need to use natural language processing to categorize the various answers that I’d be collecting. Categorizing the questions themselves was an easy task - I just needed to show what page I scraped it from in order to give the player a hint. However, categorizing the answers so that users would see a series of answers that make sense together was a more challenging task. I ended up using scikit-learn as my basis for a classification model.
When building a model you need to do some manual work in order to lay out your training data. I scraped a few years worth of data from the x in music pages and manually placed the answers into the following categories:
Most of the answers on these pages fall into one of the above categories. It took some tweaking to establish what the base categories should be, but I found that these worked well enough. When scraping I only saved the first few sentences from each answer, as overly verbose data seemed to make the models less accurate.
Here’s an example of what an entry for
album/good news for people who loves bad news_fact_data.txt looks like:
Good News for People Who Love Bad News is the fourth studio album by American rock band Modest Mouse, released on April 6, 2004 by Epic Records. Jeremiah Green, who played drums on all other Modest Mouse releases, did not perform on this album due to his temporary absence from the band. Good News for People Who Love Bad News was nominated for the Grammy Award for Best Alternative Music Album in 2005.%
Once I felt I had the data sufficiently populated to train, I set up scikit-learn to create a model. If you’re looking at my source code this can all be seen in
The first step in model creation is to lemmatize the text. The lemmatization process I followed involves removing any special characters, single letter words, multiple spaces, pluralization, and capitalization. This is done so the machine learning algorithms can more easily work with the text - it’s easier to analyze and classify when language is broken down into its dictionary forms.
There are a few different libraries to help you lemmatize in Python - I chose the WordNetLemmatizer from the natural language toolkit, which provides the
After lemmatization the words are vectorized - simply put, this is the process of converting the words into numeric form. This allows computers to classify and compare naturally written language. I used the
scikit-learn for this and it’s a relatively straightforward process.
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer (min_df=7, max_df=0.8, stop_words=stopwords.words('english')) X = vectorizer.fit_transform(documents).toarray()
stop_words are provided to the
TfidVectorizer constructor in order to make the vectorization work better for our needs.
df is short for document frequency - if a term is seen too much or too little in the text it will not be vectorized.
stop_words is provided values from the natural language toolkit to remove the most common words in the English language - these words are so common they make classification less accurate. Stop words are pretty much noise when it comes to NLP.
Another important part of building a model involves separating the sample data into test data and training data. In order to be able to evaluate how well the model works you must separate the data so that some data can be used to train the model while the leftover data is used to gauge how effective the model is. The output from
train.py looks like this:
precision recall f1-score support album 0.88 0.94 0.91 16 award 1.00 0.40 0.57 5 band 0.88 0.90 0.89 41 chart 1.00 1.00 1.00 5 date 1.00 1.00 1.00 31 event 0.75 0.60 0.67 10 genre 1.00 0.67 0.80 6 location 0.81 0.89 0.85 28 organization 1.00 1.00 1.00 4 person-female 0.80 0.42 0.55 19 person-male 0.71 0.87 0.78 46 song 0.86 0.90 0.88 20 accuracy 0.84 231 macro avg 0.89 0.80 0.83 231 weighted avg 0.85 0.84 0.84 231
As you can see, the accuracy for this model is ok - not great, but ok. Good enough for my needs at least. When looking at this I normally fixate on the
macro avg, and
person-male could definitely use some improvement when it comes to these categories - there’s a chance that there is some bad training data in the mix. Garbage in, garbage out.
After that the model is pickled, which is Python’s way of serializing data, so that we can save it to a file and use it later. In other words, the model is ready for use!
crawler.py can now be used to scrape the provided Wikipedia pages and will do its best to classify answers according to their categories. This means that by running something like
python3 crawler.py "2004 in music" "Events"
we can automatically generate a set of questions based on events in music during 2004! How the the model is used be seen in the
get_facts_and_metadata_from_html function. It ends up writing the answers in a way that can be analyzed by the quiz software - it saves the
fact (which is the answer), a
summary of the fact (which is the first few sentences of the wikipedia article), which
category the answer falls into, and the
confidence of the model. The confidence is saved so that the quiz can reject answers that have a low confidence - this prevents answers that might not make sense.
From there the quiz questions and answers are served up via a websocket server that the Vue based front end listens to. The websocket server also listens for users connecting, disconnecting, and answering questions. It also analyzes the users’ answers to keep score. None of the user data persisted in any way - if the server dies or a user disconnects their score is lost forever.
I feel like that’s probably the most interesting tidbits about how I used NLP to create a quiz game - feel free to browse around the (sometimes spaghettiesque) code and let me know if you have any questions! I hope that you’ve found this article interesting and informative.