What is sentiment analysis?
Sentiment analysis, in a nutshell, is used to predict whether a text is negative, neutral, or positive about certain topic without having to read the full text. With the development of various Natural Language Processing (NLP) libraries, sentiment analysis has been an interesting area of exploration. So far, tweets and product / movie reviews are the most common targets of sentiment analysis. Applying sentiment analysis to product reviews, retailers can have a good sense of how the product is liked among users without having to read all of their feedback. Similarly with tweets you can apply sentiment analysis to a collection of tweets addressing the target topic in order to get a general idea of public’s attitude toward the topic.
Sentiment analysis on news articles
For a recent project I had to build a sentiment analysis model to predict sentiment about a ‘target company’ based on news articles about the company. Unfortunately most of the existing libraries for sentiment analysis are focused on analysing shorter, less complex texts—typically, tweets or product reviews. Building a sentiment analysis model for news articles is more complicated than tweets or product reviews; while most of tweets and product reviews are well focused on a single topic, news articles could be addressing several different themes, views and opinions simultaneously concerning the ‘target topic’ we want to focus on. Here is a good example:
An article was talking about the problem of workers of mining company getting ‘Black Lung Disease’ from their work environment and how terrible it is. The article contains examples of miners struggling with the diseases and details of how lots of mining companies are denying their responsibility. Toward the end of article it mentions that the special tax which mining companies have been paying to support those with black lung diseases could be reduced due to new law that President Trump is enforcing. Now, the ‘target company’ I was looking for in this article is one of the mining companies mentioned. At this point, I was quite confused, technically, I should categorise this article as ‘positive’ as it is a positive news for the mining company. However, the article is quite critical and negative as it is talking about people struggling to deal with illness and disease. It was the point that I realised building a sentiment analysis model for news article is not at all straight forward.
Approaches for sentiment analysis
As you might have guessed, machine learning (ML) is one of the most common approaches to tackling sentiment analysis problems. Among the common ML methods, a ‘bag of words’ approach is one of the most simple, yet well-performing. The common process of the ‘bag of words’ approach for sentiment analysis is broadly as follows:
- Preprocess the text—for Python, NLTK is your best buddy here.
- Vectorise the text—n-grams is a popular approach to do so, for my case, as I was focusing on financial sentiment, I created frequency vector of financial terms.
- Apply normalisation to vectors—TF-IDF is a common one.
- Split train / test set, with word vector as input, sentiment as output
- Train and test using an appropriate ML model such as Support Vector Machine (SVM)
For my project, the challenge was in the first step, preprocessing. As mentioned earlier, most news articles include more than one topic or point of view—and in order to focus on the sentiment toward the “target topic”, you need to preprocess the article wisely. I applied several heuristics in order to filter and preprocess the text; for instance, if the location of first mention of target company is towards the end of article, it more likely that the article is irrelevant to my focus.
Another issue involved the creation of a training set. For an ML approach like SVM I needed a large set of training data with the ‘output’ labelled. In other words, I needed hundreds and thousands of article to be read and labelled with its sentiment towards my target company.
Nonetheless, this approach is simple and straight forward.
Lexicon Based (Rule Based) Method
Lexicon based method, which can be more stable than ML approach, is to apply set of dictionary based rules to the text to identify the sentiment of the text. There are several factors to consider for the lexicon based approach:
- Dictionary: list of words with appropriate sentiment values assigned to each one. Also need to consider if it is noun, adjective, verb, adverb, etc.
- Intensifier: is a list of words that can be prefix a word to ‘intensify’ the magnitude of sentiment. e.g. “This company is doing REALLY well this year.” – the word ‘REALLY’ here is the intensifier word for the word ‘well’
- Negation word: is a list of words that can prefix a word to negate the meaning of the word.
- Text level features: an example of it is: if a certain word appear more than ‘n’ times in the text, the value of its sentiment should be modified accordingly.
- Others: weighting of different parts of the text, normalisation, ignoring the sentiment of quotes, etc.
Although this rule based approach may sound simple, it has numerous aspects to consider very carefully.
As an example, have a look at following two sentences:
A) The teacher inspired her students to pursue their dreams.
B) This movie was inspired by true events.
The word ‘inspire’ is clearly imposing positive sentiment in the first sentence; however, it is neutral in the second sentence.
As you can see, both Machine Learning and Lexicon Based Approach have its own advantages and disadvantages. For an ML approach, creating an appropriate training set can be rather tedious work, but once you have large enough training set, a simple model such as SVM can result in good performance. For a Lexicon Based Approach there is no need to create large training set; however, it does require you to build many sophisticated heuristics and rules around the text to identify the sentiment of the text accurately.