Using Machine Learning to Filter the News

By: on October 2, 2020

In Oliver Wyman’s recent Hackathon, we choose to create a News Filtering solution. The idea was to allow users who are sensitive to certain issues to be able to read the news without stumbling upon those issues. This could be useful in many cases, for example, to allow people with health anxiety to read the news during this pandemic. In the longer term we envisaged a Chrome extension where the user states on setup what topics they would like to avoid, and then news about those topics would be filtered out from popular news websites when the user visits them. Of course, we couldn’t get all the way there the 48 hours allowed for the hackathon, but we did get far enough along to find some interesting issues, which we’d like to share.

Due to time limitations, we trained our model using past articles falling under only three classifications: Brexit, football, and coronavirus. When we then used the model to classify articles about different topics, we found that the model predicted their classifications in a logical way. For example, UK politics were automatically categorized as Brexit news, basketball and golf were classified as football, and all US news was classified as coronavirus news. This highlights the importance of having a large training data set with a variety of categories to achieve reasonable accuracy, which our model does not yet have—but it also shows the inherent powers of machine learning algorithms.

As well as learning about the importance of the data used, we also learned about the struggles that can be faced when putting together a data set. One of our first steps in the hackathon was to create our training data set, as we knew it was vital for us to train our AI and complete our project. We looked for a set of categorized news articles we could draw from and found that the Guardian has an API (documented here) which exactly fulfilled our requirements. But then, we hit a snag—how to make sure the news we collected about non-coronavirus topics didn’t mention coronavirus when coronavirus is such a pervasive topic. In the end, we went with a very simple solution, only using pre-2020 articles in the non-coronavirus topics. However, if the tool was expanded to filter other, less recent topics, this solution would not be possible, which could vastly increase the effort that would have to be put in to create a good training data set.

After collecting our news articles, we processed them to be parsed by a computer. We did this using a word tokenizer package (nltk) to tokenize sentences into word snippets. Through tokenization, a sentence like “he is great at playing football” is split into snippets ‘he is’, ‘is great’, ‘great at’, ‘at playing’, ‘playing football’, which are then used as model parameters. Note that words like ‘is’ and ‘at’ are general words that do not improve the model as such. Therefore, before tokenizing sentences, we edited them. We removed general words, punctuation marks, formatting, and non-alphanumeric characters. We also made all text lower case and changed the tenses of words. For example, the sentence “He is great at playing football!” would be edited into the following form: ‘he great play football’. This processing turned each article into a list of keywords. Here is an example of an article after some processing has been done, which gives an interesting insight into the data the model is actually trained on. The link to the original article is here.

summari biggest develop global coronaviru outbreak josh halliday thu may bst last modifi thu may bst key develop global coronaviru outbreak today includ south korea tighten restrict metropolitan area seoul spike infect restrict lift across countri may outbreak appear brought control howev offici record biggest spike infect nearli two month prompt closur museum park art galleri seoul area two week friday half south korea million peopl live metropolitan area restrict tighten data john hopkin univers show unit state record death covid move past sombr mileston even mani state relax mitig measur stop spread novel coronaviru us record death diseas countri pandem almost three time mani second rank countri britain donald trump remain silent death american covid us mourn mileston presid made comment twitter moment day use platform attack tech compani tri censor day twitter put fact check warn one claim number peopl infect coronaviru exceed million accord data compil john hopkin univers us home peopl diseas around world data show way ahead brazil russia uk spain itali true number infect like much higher howev given vast number unrecord asymptomat case un world food program warn least million peopl could go hungri latin America coronaviru pandem rage new project releas late wednesday estim startl four fold increas sever food insecur third european foreign direct invest project announc either delay cancel outright coronaviru pandem annual survey profession servic group ey found project question alreadi place continu albeit downgrad capac recruit ey said delay percent cancel europ attract survey found ireland face deepest ever recess coronaviru lockdown devast job strain public financ think tank said thursday report ireland econom social research institut predict nation gross domest product gdp declin year like scenario govern plan lift lockdown august economic struggl return pre pandem level owe physic distanc measur esri said sixti one conserv mp defi british pm bori johnson call move domin cum crisi senior minist broke rank accus special advis inconsist account behaviour lockdown former chancellor sajid javid also said journey necessari justifi number backbench call cum resign sack grew tori mp weigh criticis two condemn cum govern whip question rais studi publish lancet prompt world health organ halt global trial drug hydroxychloroquin lancet said author investig urgent appar discrep data come amid scientist concern rigor standard fall waysid race understand viru

Apart from learning about machine learning, we also learned what it was like to do a Hackathon remotely! There were some obvious disadvantages, such as the extra barrier to communication, but there were also surprising benefits. Since everyone was remote, there was no limitation to who could work together; our team included people working in different countries, who never could have worked together if the hackathon had taken place in person. Also, working from home meant there was no need to spend long hours in the office without being able to nap/shower/cook, which meant the hackathon was a lot less tiring than it could have been since we could take frequent breaks in our own home throughout. Overall, it was an experience we’d recommend.


Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>