Ayshwarya Srinivasan, Vivek Sahoo, Anjali Shalimar and Digvijay Kawale
GithubSarcasm is a persuasive linguistic phenomenon in online documents that express subjective and deeply felt opinions. Detection of sarcasm is of great importance and beneficial to many NLP applications, such as sentiment analysis, opinion mining and advertising.
The other important application is the conversational AI, as in its current stage it has not completely emulated to human conversations. Different conversational AI systems like ‘Siri’ in Apple devices, ‘Google assistant’ in Google devices, ‘Alexa’ – conversational music station of Amazon, etc. are good at doing simple tasks like setting alarms and reminders but they lack the real feel of human conversation. Specifically, they are very weak at detecting the sarcasm and thus making it a topic of interest for further development of such products.
With this motivation, we are attempting to explore this space by doing a project on a topic ‘Sarcasm Detection in News Headlines’. Sarcasm writing is common in news headlines, which helps to build an environment for us to study one specific application. News headlines data provides a concise and direct application to our topic of interest.
The data for news headlines was sourced from Kaggle. The dataset contains news headlines collected from two news sources – The Onion and HuffPost. The Onion is known for sarcastic headlines, thus becomes a good source of data for this project. The headlines collected from HuffPost are non-sarcastic headlines. Using a dataset with a proper mix of sarcastic and non-sarcastic headlines will be helpful to build non overfitting models. The data from the Kaggle is in the format shown below:
Column Name | Column Description |
---|---|
is_sarcastic | Flag to classify headlines into sarcastic and non-sarcastic headlines. It has value 1 for sarcastic headlines and 0 for non-sarcastic headlines |
headline | Headline of the news article. |
article_link | Link to the original news article. Useful in collecting supplementary data. |
Data contains a total of 26709 observations. Below is the snapshot of the sample data:
Data cleaning for this data set involves the standard steps of Natural Language processing. The data cleaning steps involved are outlined below:
Removing Punctuations: All the punctuations marks are initially removed from both the sarcastic and non-sarcastic headlines. In natural language processing words have more importance than the punctuations. They are removed to keep our main focus on different words used in headlines.
Tokenization : Tokenization is a process of breaking a stream of text up words, phrases, symbols and other meaningful elements called tokens. Typically, we convert natural language types into a logical entity for programming languages by tokenizing. Different words from both sarcastic and non-sarcastic headlines are tokenized using the ‘nltk’ package in python.
Removing Stopwords : A stopword is basically commonly used works such as ‘a’, ‘an’, ‘the’, etc. which do not have any sentiment associated with them and will not contribute in our modeling process. Thus, we remove all the stopwords from both sarcastic and non-sarcastic headlines.
Creating a single lists of Sarcastic Headline words and non-sarcastic headline words: After the preprocessing done in the previous steps of data cleaning, we will save all the remaining words in single lists of sarcastic headline words and non-sarcastic headline words to help us in further analysis.
Lemmatization: Lemma is basically the basic form of word. For example, the words studied, studies, studying has its basic form as study. Thus, we will be converting all the words in both the lists to their lemmas.
Bag of words: A bag of words in a simplifying representation used in Natural Language processing aiming at extracting the features from the words. Thus, we will be making a bag of words for sarcastic headlines and one for non-sarcastic words.
The data was split into 70% training and 30% testing sets. Training data set has 18696 observations and testing data set has 8013 observations. The seed used for training testing split is 15782. All the exploratory data analysis and model fitting will be done use the training data set and testing data set will be used to evaluate the model performance.
Word count comparison: Below bar graph shows the most frequent occulting words in news headlines.
Proportion of Sarcastic vs non-Sarcastic: 44% of headlines are sarcastic while the remaining 56% are non-sarcastic. Figure below shows the pie chart for the proportions of sarcastic and non-sarcastic headline.
Word clouds for Sarcastic and non-Sarcastic news headlines: A word cloud is an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance. Below are the word clouds for sarcastic and non-sarcastic news headlines. We can see that man, new, area, report the most frequently used word in sarcastic headlines and words like trump, year, new Donald are the most frequently words used in non-sarcastic news headlines.
Fig. 5.2 Proportion of Sarcastic and Other Comments
Fig. 5.3 Word Cloud – Sarcastic Headlines
Fig. 5.4 Word Cloud – Non-Sarcastic Headlines
We now move towards the modeling part of the project with an aim to build a tool that is able is detect sarcasm in sentences. We will be processing the data to transform it as per our model input requirement.
Removing punctuations: Most text contains punctuations. In detecting sarcasm, the presence of punctuations doesn’t necessarily contribute to the model performing better. So, we aim to strip the data of all punctuations.
Remove digits: We are going to vectorize our data and convert the strings to numbers. Presence of numbers in the data would not help in identifying the tone any better. Moreover, the pre-existing digits might interfere with the vectorization process. Hence all numbers are removed as well.
Converting to lower case: Converting the text to lower case helps make the data uniform.
Removing stop words: Most headlines or any natural language data would contain stop words that are usually removed as stop words generally appear in abundance and do not provide any valuable information during classification.
Lemmatization: Lemmatization is the process by which any inflected version of a word is converted to its base word so that all forms of a word are treated the same.
Vectorization and padding: Vectorization are a process by which words are mapped to the numeric vectors. For LSTM model, the input should be of same size. Hence, we pad the vectors with zeros to ensure uniformity.
We will be building a Keras model for the data set. Below are the steps involved in the Keras model building:
Embedding layer: The Embedding layer is used to create word vectors for incoming words. It sits between the input and the LSTM layer, i.e. the output of the Embedding layer is the input to the LSTM layer.
LSTM Layer: The LSTM transforms the vector sequence into a single vector containing information about the entire sequence.
Intermediate Layer: There is a Dense intermediate layer with 64 neurons and with activation function relu.
Output Layer: The final output we want from this model is whether the headline is sarcastic or not. So, we want to perform classification. The output layer’s activation function is thus sigmoid.
We will first be building a base model with the following processing: Remove punctuations, digits and convert the text to lowercase. · Remove all stopwords
· Perform Lemmatization
· Perform tokenization and pad the resulting sequence – prepadding
It was observed that we were getting an accuracy of 84.7% using this base model.
We build a model very similar to the base model with the following change: After performing the tokenization, do post-padding.
Example: Before padding: [234,5,67,12]
The maximum length is 7
Pre-padding: [0,0,0,0,234,5,67,12]
Post-padding: [234,5,67,12,0,0,0,0]
Surprisingly, we see a huge drop in the accuracy from 85% to 55% now. The reason why this is because we are building a Long Short-term memory model. When the padding is in the beginning, the useful content is at the back and is therefore the latest information the model takes in. This stays in memory and results in a better model. We are going to proceed further with pre-padded sequence for future models.
We now build a model that is a modification of our base model. We want to see the effect of lemmatization. Lemmatization is the process by which any inflected version of a word is converted to its base word so that all forms of a word are treated the same. We want to see if in detecting sarcasm, the effect of inflect plays a role in improving the efficiency of our model.
We observed that not performing lemmatization improves the accuracy but just slightly from 85% to 85.6%
We test the effect of what if stop words are on the model. In general, in NLP models, we remove stop words. But our theory is that the stop words might actually have an effect in identifying the sarcasm in a sentence.
We want to see if in detecting sarcasm, the effect of stop words plays a role in improving the efficiency of our model.
As per the results after applying this, the accuracy did not increase too much from the base model.
Hyperparameter optimization (or tuning) is the process of choosing a set of optimal parameters for a machine learning algorithm. Data preprocessors, optimizers and ML algorithms all receive a set of parameters that guide their behavior. To achieve optimal performance, they need to be tuned to fit the statistical properties, type of features and size of the dataset used. The most typical hyperparameters in deep learning include learning rate, number of hidden layers in a deep neural network, batch size, dropout, etc. In NLP we also encounter a number of other hyperparameters often to do with preprocessing and text embedding such as type of embedding, embedding dimension, number of RNN layers, etc. The hyperparameters in consideration are listed below. We will be using the Grid search method for selecting the best parameters.
· Layer Activation
· Number of Epochs
· Optimizer
Results from Hyperparameter Tuning:
From the grid search we get the following results:
1.The best optimizer for our model is adam
2.The best activation to use is relu
We also try various epoch values: 2,5,10. Below table shows the results by using different epochs.
Epoch Value | Training Accuracy | Testing Accuracy |
---|---|---|
2 | 91.9% | 85.6% |
5 | 98.7% | 82.8% |
10 | 99.3% | 83.3% |
Table: Training and Testing Accuracy for different Epochs
We see that for epochs higher than 2, even if the training accuracy increases the testing accuracy goes down. This could be because of overfitting.
The final model we build is a Keras model with LSTM based on the training and testing accuracy.
Preprocessing: Removing stop words, punctuations, digits and converting to lower case.
Epochs: 2
Activation: relu
Output Activation: sigmoid
Optimizer: adam
Training accuracy: 91.9%
Testing accuracy: 85.6%