December 16, 2018

Our Story
or How we Got Here

This project was a required part of the CS 410 Text Information Systems program at the University of Illinois - Urbana/Champaign. CS 410 is a class that is part of the Masters in Computer Science with a specialization in Data Science. We needed an interesting and possibly even work related project.
.

After searching for options and approaching one user group with what we thought would be an interesting project, the investigator said he would be interested in a tool that would scan comment sections on websites for hate related speech. They use this information to investigate people intent on committing hate related crimes. He informed me that recently comments posted on a website were used to identify someone who was later arrested for a hate crim. This seems simple enough, but after some discussion it became clear that some words or topics have a greater investigative value. A light came one that basically clued the team in that the ability to rank a search term above others is what they were seeking. We had a number of false starts with MongoDB and a scrapying tool and trying to customize Metapy. After some research we found a sample code base that we were able to understand and then extend.

One final thought has to do with BM25, we made dozens of different adjustments, using logs, division, and crazy math. Eventually we settled on a very simplistic adjustment to the BM25 as it moved documents the best for our needs.

Learn more about BM25 and our implementation here
Link to BM25