Financial Sentiment Analysis Using Machine Learning Approach

Table of Contents

In this project, we will demonstrate financial sentiment analysis using machine learning with TF-IDF and Bag of Words approach which we have studied in our previous lessons.

What is Financial Sentiment Analysis ?

Financial Sentiment Analysis (FSA) is the method of classifying financial text or news into positive, negative or neutral sentiments which directly gives the idea of bullish or bearish view of financial market.

Challenges in Financial Sentiment Analysis

Financial sentiment analysis is a challenging tasks as it requires large-scale training data for building machine learning models and difficulty in labelling the financial text as it requires expert knowledge.

Another major challenge with FSA is seriousness of mistakes because analyzing sentiments from movie reviews, product reviews, customer feedbacks, social media posts requires understanding customer feedbacks, aggregating straightforward opinions and some amount of wrong analysis does not make a much difference. Whereas single mistake in sentiment analysis for financial applications may cause huge losses. So, we should very careful in handling exception cases.

After reading this article you will able to classify financial texts into positive, negative or neutral sentiments by training Multinomial Naïve Bayes and Support vector Machine models and can understand the performance difference between TF-IDF and Bag of Words approaches of text representation.

About Dataset

In this project, we have used Financial PhraseBank dataset which consists of English financial news headlines of companies listed in OMX Helsinki Exchange. This dataset was first introduced in a research paper “Good Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts”.

This dataset was labelled by a team of experts in finance, economics and accounting domain, who classified each sentence into positive, negative or neutral effect on the company’s stock prices.

Some of the sample sentences from the dataset are shared below:

“Cash Flow from Operations for the most recent quarter also reached a eight year low“ – Negative Sentiment

“The cooperation will double The Switch ‘s converter capacity“ – Positive Sentiment

“There have not been previous share subscriptions with 2004 stock options“ – Neutral Sentiment

From the above sample sentences it is evident that financial sentiment analysis is a complex tasks as it has different vocabulary for different sentiments in financial domain.

In this project, we have evaluated multiple machine learning models and found Multinomial Naïve Bayes and Support Vector Machine algorithms performed better in terms of classification accuracy and recall. So, in this article we will train above two machine learning models with both bag of words and TF-IDF method of text representation.

Importing Python Libraries

In the first step, we will import all the necessary python libraries to be required in visualization, data cleaning, machine learning model building and evaluation process.

# load all necessary libraries
import pandas as pd
pd.set_option('max_colwidth', 100)
import numpy as np
import itertools
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# libraries for nlp task
import nltk, re, string
from string import punctuation
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
#machine learning
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import  accuracy_score, f1_score, precision_score,confusion_matrix, recall_score, roc_auc_score
from sklearn.model_selection import train_test_split,cross_val_score,KFold
# filtering warnings
import warnings
warnings.filterwarnings('ignore')

Read Dataset

After importing all the libraries we have to load the dataset by using pandas library and pandas.read_csv() method.

# load data
df = pd.read_csv("finbank_data.csv")
df.head()

financial phrasebank sample data records — Sample records of Financial PhraseBank dataset

After reading the data and checking sample records, we have to check the shape of the dataset by executing df.info() command in pandas.

As we can see from the above information that dataset has 14,780 records having non-null entries.

Exploratory Data Analysis (EDA)

In this step, we will analyze the dataset by extracting important information from the financial text.

Firstly, we have to check the distribution of the data.

Target distribution

We can easily plot the distribution of target variable i.e., label in our case by using seaborn library as shown in below code.

sns.set_style("dark")
sns.countplot(df.label)

Target distribution of Financial PhraseBank dataset — Target distribution of dataset

From the dataset, it is visible that the dataset is highly imbalanced having majority of records with neutral sentiments whereas positive and negative sentiments have lower number of records in the dataset. If you want to know more about imbalanced data problem you can check this link.

So, next we have to balance the dataset and we will do it by down sampling the neutral sentiment and positive sentiment records to the level of negative sentiment records i.e., 2000 so that dataset become balanced and the model will give good performance.

# taking subset of dataset by downsampling records to 2000
df_pos = df[df.label=='positive'].head(2000)
df_neu = df[df.label=='neutral'].head(2000)
df_neg = df[df.label=='negative']

# concatenating the datasets
df_final = pd.concat([df_pos,df_neg],axis=0)
df_final = pd.concat([df_final,df_neu],axis=0)

# checking distribution of target variable in final data
sns.set_style("dark")
sns.countplot(df_final.label)

Distribution of target variable in down sampled data

As we can see now we have a balanced distribution of all the classes in the dataset.

Next we have to shuffle the dataset as we have concatenated different subset of data and it may contain sequence wise patterns. So for shuffling the dataset we will reindex the data frame using Numpy random permutation.

df_final = df_final.reindex(np.random.permutation(df_final.index))
df_final.head(10)

After shuffling the dataset, we will calculate the length of the text so that we can do univariate analysis based on different sentiments.

df_final['length'] = df['text'].apply(len)
df_final.head()

Next we will check the overall the distribution of length of the texts in the dataset.

df_final.length.describe()

As we can see that the highest length of the message in the dataset is 301 whereas minimum length is 2. the average length of the messages in the dataset is 126.

Now let’s plot the histogram for the length of all the sentiments data.

df_final.hist(column='length', by='label', bins=50,figsize=(10,4))

Histogram showing distribution of the length of messages for all the three sentiments

From the above distribution, we can observe that positive sentiment messages are usually longer in length as they have dense distribution from length 50 to 220 whereas majority of shorter messages of length <=50 belongs to negative sentiment.

Word cloud

Next, we will plot the word cloud so that we can understand the class-wise distribution of words in the dataset. But before plotting word cloud we need to remove stop words so that it can only show significant words in the word cloud.

# creating list of stop words
stop = set(stopwords.words('english'))
punctuation = list(string.punctuation)
stop.update(punctuation)
# Removing stop words which are unneccesary from financial text text
def remove_stopwords(text):
    final_text = []
    for i in text.split():
        if i.strip().lower() not in stop:
            final_text.append(i.strip())
    return " ".join(final_text)
df_pos['text']=df_pos['text'].apply(remove_stopwords)
df_neg['text']=df_neg['text'].apply(remove_stopwords)
df_neu['text']=df_neu['text'].apply(remove_stopwords)

# plotting Positive sentiment wordcloud
plt.figure(figsize = (10,12)) # Text that is Positive sentiment
wc = WordCloud(max_words = 500 , width = 500 , height = 300).generate(" ".join(df_pos.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('Positive Sentiment')

# plotting Negative sentiment wordcoud
plt.figure(figsize = (10,12)) # Text that is Positive sentiment
wc = WordCloud(max_words = 500 , width = 500 , height = 300).generate(" ".join(df_neg.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('Negative Sentiment')

# plotting Neutral sentiment wordcoud
plt.figure(figsize = (10,12)) # Text that is Positive sentiment
wc = WordCloud(max_words = 500 , width = 500 , height = 300).generate(" ".join(df_neu.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('Neutral Sentiment')

Word cloud of Positive, Neutral and Negative sentiment data (from clockwise direction)

As we can see from above word cloud positive sentiment consists of high frequency words such as operating profit, EUR mn, net sales, profit rose, increased, net profit etc., whereas negative sentiment consists of words such as operating profit, net sales, decreased EUR, EUR mn etc.

Data Cleaning and Data preprocessing

In this step, we will lowercase all the words, remove the stop words, tokenize the text, perform lemmatization, and remove all non-alphabetic characters from the text. The code for the above-mentioned task is shared below:

lemma = WordNetLemmatizer()
#creating list of possible stopwords from nltk library
stop = stopwords.words('english')
def cleanText(txt):
  # lowercaing
    txt = txt.lower()
    # tokenization
    words = nltk.word_tokenize(txt)
  # removing stopwords & mennatizing the words
    words = ' '.join([lemma.lemmatize(word) for word in words if word not in (stop)])
    text = "".join(words)
  # removing non-alphabetic characters
    txt = re.sub('[^a-z]',' ',text)
    return txt  
#applying cleantweet function on tweet text column
df_final['cleaned_text'] = df_final['text'].apply(cleanText)
df_final.head()

From the above output, we can observe that we have cleaned the text and created a separate column named cleaned_text.

Next, we will create a feature and target variable for building a machine learning model as shown in below code, where X denotes textual feature and y denotes class label i.e., sentiments.

X=df_final.cleaned_text
y = df_final.label

Train Test Split

In this step, we will divide the dataset into train and test set in the ratio of 80:20 i.e., 80% for training the machine learning model and 20% for testing the model. Further we have also applied stratification by which we can maintain the balanced proportion of sentiments in both training and test set.

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20,stratify=y, random_state=0)

Bag of Words vectorization

After train test split, we will use bag of words method to vectorize our textual and that is possible in python using CountVectorizer method from sklearn.

count_vectorizer = CountVectorizer()
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)

TF-IDF Vectorization

Next, we will use TF-IDF vectorizer to vectorize our textual data so that later we can compare the performance of machine learning models based on these two approaches i.e., TF-IDF and Bag of Words model.

# bigrams
tfidf_vectorizer = TfidfVectorizer( max_df=0.8, ngram_range=(1,2))
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

From the above code, we can observe that we are using two parameters i.e., max_df =0.8 and ngram_range=(1,2). The former parameter restricts the vectorizer to use the words which have more than 80% frequency in the document whereas later parameter allowing both unigrams and bi-grams in the data. For learning parameters in details check here.

Next we will build our machine learning models.

Machine Learning Modeling

In this step, we will fit machine learning models i.e., Multinomial Naïve Bayes and Support Vector Machine on both bag of words and TF-IDF vectorized data.

# Multinomial Naive bayes bag of words model
mnb_bow = MultinomialNB()
mnb_bow.fit(count_train, y_train)

# Multinomial Naive bayes tf-idf  model
mnb_tfidf = MultinomialNB()
mnb_tfidf.fit(tfidf_train, y_train)

#SVM bag of words model
svm_bow =SVC(probability=True,kernel='linear')
svm_bow.fit(count_train, y_train)

# SVM tf-idf model
svm_tfidf = SVC(probability=True,kernel='linear')
svm_tfidf.fit(tfidf_train, y_train)

After fitting the machine learning models on both the TF-IDF and Bag of Words vectorized data, we will proceed to cross-validation.

Cross validation

In cross-validation, we usually split the training data into multiple folds and train the model on those folds by setting one fold data for evaluation purpose. Cross-validation helps in detecting overfitting pattern in the data.

# 10-folds cross validation
kfold = KFold(n_splits=10)
scoring = 'accuracy'
acc_mnb = cross_val_score(estimator = mnb_bow, X = count_train, y = y_train, cv = kfold,scoring=scoring)
acc_svm = cross_val_score(estimator = svm_bow, X = count_train, y = y_train, cv = kfold,scoring=scoring)
acc_mnb_tfidf = cross_val_score(estimator = mnb_tfidf, X = tfidf_train, y = y_train, cv = kfold,scoring=scoring)
acc_svm_tfidf = cross_val_score(estimator = svm_tfidf, X = tfidf_train, y = y_train, cv = kfold,scoring=scoring)


# compare the average 10-fold cross-validation accuracy
crossdict = {        
                'MNB-BoW': acc_mnb.mean(),
               
                'SVM-BoW':acc_svm.mean(),
               
                'MNB-tfidf': acc_mnb_tfidf.mean(), 
               
                'SVM-tfidf': acc_svm_tfidf.mean(),
             
                }




cross_df = pd.DataFrame(crossdict.items(), columns=['Model', 'Cross-val accuracy'])
cross_df = cross_df.sort_values(by=['Cross-val accuracy'], ascending=False)
cross_df

From the above results, it is evident that SVM with TF-IDF achieved highest average 10-fold cross-validation score of 88.46% whereas SVM bag of words achieved a score less than 2% i.e., 85.95%.

Model Evaluation

In this step, we will evaluate both the machine learning models on the test set based on different performance metrics such as accuracy, precision, sensitivity(recall), f1-score, and roc value.

Multinomial Naïve Bayes – Bag of Words Model

pred_mnb_bow = mnb_bow.predict(count_test)
acc= accuracy_score(y_test, pred_mnb_bow)
prec = precision_score(y_test, pred_mnb_bow,pos_label='positive',
                                           average='macro')
rec = recall_score(y_test, pred_mnb_bow,pos_label='positive',
                                           average='macro')
f1 = f1_score(y_test, pred_mnb_bow,pos_label='positive',
                                           average='macro')
cm = confusion_matrix(y_test, pred_mnb_bow, labels=['positive', 'negative','neutral'])
plot_confusion_matrix(cm, classes=['positive', 'negative','neutral'])

model_results =pd.DataFrame([['Multinomial Naive Bayes-BoW',acc, prec,rec,f1]],
               columns = ['Model', 'Accuracy','Precision', 'Sensitivity', 'F1 Score'])
model_results

The model has achieved an accuracy of 79.89% whereas sensitivity and precision is slightly higher than accuracy. The main issue with this model is that it has detected 67 positive sentiment cases as negative while 63 as neutral whereas model has done pretty well in detecting negative sentiment cases.

Next, we will define one function for evaluating the machine learning model so that we don’t have to write same code for evaluation again and again.

def evaluate_model(model,model_name,test_set,model_results):
    pred = model.predict(test_set)
    acc= accuracy_score(y_test, pred)
    prec = precision_score(y_test, pred,pos_label='positive',
                                               average='macro')
    rec = recall_score(y_test, pred,pos_label='positive',
                                               average='macro')
    f1 = f1_score(y_test, pred,pos_label='positive',
                                               average='macro')
    cm = confusion_matrix(y_test, pred, labels=['positive', 'negative','neutral'])
    plot_confusion_matrix(cm, classes=['positive', 'negative','neutral'])

    results =pd.DataFrame([[model_name,acc, prec,rec,f1]],
                   columns = ['Model', 'Accuracy','Precision', 'Sensitivity', 'F1 Score'])
    model_results = model_results.append(results, ignore_index = True)
    return model_results

SVM – Bag of Words Model

model_results = evaluate_model(svm_bow,'SVM-BoW',count_test,model_results)
model_results

SVM with bag of words has done a pretty good job in comparison to Naïve Bayes model as it bring downs the number of misclassification rate for positive and neutral sentiment cases whereas sensitivity for negative sentiment is very high only 9 misclassified cases out of total 369.

Multinomial Naïve Bayes – TF-IDF Model

model_results = evaluate_model(mnb_tfidf,'MNB-TF-IDF',tfidf_test,model_results)
model_results

As we can see from above result that Naïve Bayes model with TF-IDF performed way better than bag of words model and attained higher accuracy of 86.05% which is 7% higher than bag of word model.

Atlast, we will check the performance of SVM on TF-IDF vectorized data.

SVM – TF-IDF Model

model_results = evaluate_model(svm_tfidf,'SVM-TF-IDF',tfidf_test,model_results)
model_results

From the above results, it is evident that SVM with TFIDF performed best among all the models with an accuracy of 90.76% and sensitivity for all the three classes are more than 90%.

Next, we will use our best model i.e., SVM TF-IDF to do prediction on sample sentences.

Sample Prediction

So, for doing the prediction on real-time textual data, firstly we have to preprocess the data and then transform it into vectorized format using TF-IDF vectorizer.

def predict_text(lst_text,model):
    df_test = pd.DataFrame(lst_text, columns = ['test_sent'])
    # apply data preprocessing
    df_test["test_sent_cleaned"] = df_test["test_sent"].apply(cleanText)
    # transforming text into tf-idf vectorized format
    tfidf_bigram = tfidf_vectorizer.transform(lst_text)
    # model prediction
    prediction = model.predict(tfidf_bigram)
    # saving prediction in dataframe by creation prediction column
    df_test['prediction']=prediction
    # subset dataframe to include only original entences and model prediction
    df_test = df_test[['test_sent','prediction']]
    return df_test

sentences = [
  "Operating profit declined by 27 % to EUR 579.8 mn from EUR 457.2 mn in 2006",
    "This new partnership agreement achieved a significant milestone for both parties",
    "For around 3 years business remains same"
    
    
  ]

predict_text(sentences,svm_tfidf)

As we can see from the above model prediction, the model has accurately predicted negative, positive and neutral sentiment from sample sentences.

Conclusion

So, in this project we have built machine learning model for predicting sentiments of financial textual data. In the project we have used Financial PhraseBank dataset for building the machine learning model and compared both bag of words and TF-IDF vectorization approaches. In the end, we found that SVM with TF-IDF vectorizer outperformed other models and attained more than 90% accuracy, precision and recall. Further we have used the best model to perform sample prediction in which the model has done accurate prediction for all the three sentiments.