A Beginner’s guide for Machine Learning Text Classification using Python

Machine Leaning Text Classification using Python

Sharing is Caring!

Text classification is the core branch of natural language processing in which different textual notes are categorized into specified categories. It could be separating negative product reviews from the positive ones, classifying positive/negative/neutral sentiments, or separating documents into different categories i.e., economy, sports, food, etc.

In simple terms, whenever you have textual data and want it to be associated with a specific category, you should look for text classification techniques.

Text Classification Techniques

There are mainly three techniques used for such a task:

1. Manual Approach

In this approach, a human need to read the textual data and then separate it into different categories based on his cognition skills for the specified language. An unnecessarily repetitive and long procedure I would avoid at all costs.

2. Rule-based Text Classification

In this approach, textual data is identified based on some rules and keywords specific to each language. Rules/keywords are then used to differentiate the documents into different categories. A major flaw of this approach is the limited scope of rules and keywords in the context of vast intricacies of any language. If any document comes out of the scope of the rules, it will obviously be misclassified. Hence, not the approach I would bank my money on.

3. Machine Learning Text Classification

In this approach, you provide labeled training data to a machine learning model which will learn the patterns in the underlying data using mathematical algorithms. Once the model has been trained, you can provide the unseen textual data and get it predicted for the relevant category.

Out of the above approaches, machine learning based text classification is preferred and has eventually become an industrial norm for various natural language processing (NLP) tasks.

In this tutorial, you will learn how to classify textual tweets into disaster and non-disaster categories. We will firstly transform the text into mathematical features using Tf-Idf vectorization. Using textual features, we will train a machine learning model using labeled data and finally evaluate the accuracy of the trained model on the testing data.

This tutorial will feature the following tools/libraries to classify tweet data using 4 machine learning algorithms namely Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and Gradient Boost classifiers:

  • Python
  • Jupyter Notebook
  • scikit-learn
  • pandas
  • seaborn
  • matplotlib
  • nltk

Let’s first make sense of the disaster tweet data…

Article Notebook

Code: Text Classification for Twitter Disaster dataset using Machine Learning and Tf-Idf Vectorization

Prepared By: Awais Naeem

Copyrights: www.embedded-robotics.com

Disclaimer: This code can be distributed with the proper mention of the owner’s copyrights

Notebook Link: https://github.com/embedded-robotics/datascience/blob/master/nlp_text_classification_tfidf_disaster_tweets/text_classification_disaster_tweets.ipynb

Disaster Tweet Dataset

This dataset represents the texts of tweets that could be associated with either a real disaster or a non-disastrous situation. It will be used to train a machine learning model for binary classification problems to predict which tweets are about real disasters and which ones are not.

Dataset Link: https://www.kaggle.com/competitions/nlp-getting-started/data

Input Features

  • id: a unique identifier for each tweet
  • text: the text of the tweet
  • location: the location the tweet was sent
  • keyword: a particular keyword from the tweet

Output Features

target: binary – real disaster (1) or not (0)

Data Reading and Exploratory Analysis

To follow along with this tutorial, set up your python environment including all the necessary libraries i.e., pandas, NumPy, seaborn, matplotlib, nltk, sklearn. Let’s first import all the libraries:

import nltk
import pandas as pd
import numpy as np
import seaborn as sns
import re
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

To use `nltk`, you will need to download some of the resources required for text preprocessing. You can download them at this stage, or later when your command fails:

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download("wordnet")
nltk.download("stopwords")
nltk.download('omw-1.4')

Download the disaster tweet dataset into the project folder. We will only focus on the train.csv file because it contains labels for the data. Read the dataset into a pandas data frame using pd.read_csv():

data = pd.read_csv("train.csv")
data.head()

Output:

id	keyword location	text	target
0	1	NaN	NaN	Our Deeds are the Reason of this #earthquake M...	1
1	4	NaN	NaN	Forest fire near La Ronge Sask. Canada	1
2	5	NaN	NaN	All residents asked to 'shelter in place' are ...	1
3	6	NaN	NaN	13,000 people receive #wildfires evacuation or...	1
4	7	NaN	NaN	Just got sent this photo from Ruby #Alaska as ...	1

As our main focus is to classify the tweets based on the textual data and not on any other features, we will drop the columns id, keyword, and location:

data = data.drop(['id', 'keyword', 'location'], axis=1)

Looking at the textual data of the tweets, we can gather some useful insights which could differentiate the non-disaster tweets from the disaster ones.

Let’s first calculate the average number of words in both classes of tweets:

data['word_count'] = data['text'].apply(lambda x: len(str(x).split(' ')))
avg_words_disaster_tweets = data[data['target'] == 1]['word_count'].mean()
avg_words_non_disaster_tweets = data[data['target'] == 0]['word_count'].mean()
print('Average Words in Disaster Tweets:', avg_words_disaster_tweets)
print('Average Words in Non-Disaster Tweets:', avg_words_non_disaster_tweets)

Output:

Average Words in Disaster Tweets: 15.20116172424335
Average Words in Non-Disaster Tweets: 14.723859972362966

Similarly, we can calculate the average number of characters:

data['char_count'] = data['text'].apply(lambda x: len(str(x)))
avg_char_disaster_tweets = data[data['target'] == 1]['char_count'].mean()
avg_char_non_disaster_tweets = data[data['target'] == 0]['char_count'].mean()
print('Average Characters in Disaster Tweets:', avg_char_disaster_tweets)
print('Average Characters in Non-Disaster Tweets:', avg_char_non_disaster_tweets)

Output:

Average Characters in Disaster Tweets: 108.11342097217977
Average Characters in Non-Disaster Tweets: 95.70681713496084

By examining the average character and word count in the above results, we can clearly see that disaster tweets tend to contain a greater number of words and characters compared to non-disaster tweets.

Now, we can also examine the word distribution of the tweets using a seaborn histogram plot:

plt.figure(figsize=(15,10))
sns.histplot(data=data, x='word_count', kde=True)
plt.title('Word Length in Tweets')
plt.xlabel('Word Length')
plt.ylabel('Tweets Count')
plt.show()
Words Distribution Histogram in Tweets Dataset
Words Distribution Histogram in Tweets Dataset

Looking at the word distribution of the tweets, we can infer that almost 99.5% of the tweets have a word count between 2 and 30 words. Hence, a tweet containing more than 30 words can be safely padded or truncated to 30 words to standardize the numerical features extracted from the text. I will get back to this later if needed.

For now, let’s see if our dataset is balanced for both the labeled classes or not:

data['target'].value_counts()

Output:

0    4342
1    3271
Name: target, dtype: int64
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='target')
plt.xlabel('Target Labels (Non-Disaster: 0, Disaster: 1)')
plt.ylabel('Tweet Count')
plt.show()
Disaster Tweet Count for Machine Learning Text Classification
Disaster Tweet Count for Machine Learning Text Classification

The dataset does not seem perfectly balanced at this point, but it is balanced enough to not affect the model training significantly. As of now, I will leave it as it is and advance to textual pre-processing, numerical features extraction and model training phases.

Data Pre-processing

Tweet disaster data is primarily in textual form and none of the machine learning models will be able to understand this data since mathematical algorithms can only process the numerical features. So, we need to convert each textual tweet into a numerical feature.

Before we go from text to mathematical feature, we need to perform the following cleaning operations on the text data for a robust model training:

  1. Remove pre and post spaces from each tweet using strip()
  2. Convert all the tweets into lowercase using lower()
  3. Remove URLs from text using a regular expression substitution re.sub()
  4. Remove Numbers from text using a regular expression substitution re.sub()
  5. Remove punctuations and special characters from text using isalpha()
  6. Remove stop words from text i.e., the, which, where, and, is, etc. using nltk.corpus.stopwords()

All of the above text-cleaning operations are performed in a custom python function which is then applied to each tweet using apply() method of the pandas data frame:

def clean_text (text):
    # Remove pre and post spaces
    text = str(text).strip()
    
    # Lower the text case
    text = str(text).lower()
    
    # Remove URL from text
    text = re.sub(r"\s+http\S+", r"", text)

    # Remove Numbers from text
    text = re.sub(r"\d+", r"", text)
    
    # Tokenize the sentences into individual words with punctuations and special characters treated as separate tokens
    word_tokens = word_tokenize(text)
    
    # Remove punctuation and special characters
    word_tokens = [word for word in word_tokens if word.isalpha()]
    
    # Generate the stop words for english language
    stop_words = stopwords.words('english')

    # Remove stopwords
    word_tokens = [word for word in word_tokens if word not in stop_words]
    
    return word_tokens
data['text'] = data['text'].apply(lambda str: clean_text(str))

To remove the stop words, a list of English stop words are generated using nltk.corpus.stopwords(). The text data is then tokenized into separate words, and only those words are kept in the final tokenized list which are not in list of stop words generated by nltk.corpus.stopwords().

After performing text cleaning, we have a tokenized list of each tweet. The final step is to lemmatize each token in the text using WordNetLemmatizer() which will convert each token based on its context within the text e.g., Caring -> Care, Stripes (Verb) -> Strip (Verb), Stripes (Noun) -> Stripe (Noun), Walking -> Walk, Running-> Run, etc.

WordNetLemmatizer() needs the wordnet tags (wordnet.ADJ, wordnet.VERB, wordnet.NOUN, wordnet.ADV) for each word to perform the lemmatization.

We can use nltk.pos_tag() to generate the tag for each token and then write a wrapper function to convert the nltk tags into wordnet tags i.e., get_wordnet_pos (tag):

def get_wordnet_pos (tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

Now, we can define a function to lemmatize each word in a tokenized text data and then return all the lemmatized tokens combined into a string.

def lemmatize (word_list):
    wl = WordNetLemmatizer()
    word_pos_tags = pos_tag(word_list)
    lemmatized_list = []
    for tag in word_pos_tags:
        lemmatize_word = wl.lemmatize(tag[0],get_wordnet_pos(tag[1]))
        lemmatized_list.append(lemmatize_word)
    return " ".join(lemmatized_list)

Finally, we can lemmatize each tweet in our dataset using the apply() method of the pandas data frame:

data['text'] = data['text'].apply(lambda str: lemmatize(str))

Now that we are done with the essential text cleaning of the tweet data, we need to convert the cleaned text into numerical features. But wait…

We firstly need to split our dataset into Training Data (80%) and Testing Data (20%) so that we can train our machine learning models on the training data and then evaluate the performance of the trained models on the test data:

data_X = data['text'].tolist()
data_y = data['target'].tolist()
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.2, random_state=0, shuffle=True)
print("X_train:", len(X_train))
print("X_test:", len(X_test))
print("y_train:", len(y_train))
print("y_test:", len(y_test))

Output:

X_train: 6090
X_test: 1523
y_train: 6090
y_test: 1523

Bag-of-Words Vectorization

To convert text into numerical features, Bag-of-Words is the simplest approach in which the vocabulary of tokens is extracted from the entire dataset.

A sparse matrix is then formed such that each unique token of the vocabulary represents the individual column of a sparse matrix, whereas each row represents an individual document (tweet).

For each document, the total count of each unique token in the document can be represented as the numerical feature. In python, such a transformation from text to token count can be achieved using CountVectorizer() from sklearn:

count_vectorizer = CountVectorizer(analyzer = 'word', min_df = 10, max_df = 0.9)
X_train = count_vectorizer.fit_transform(X_train)
X_test = count_vectorizer.transform(X_test)

Here we have specified analyzer = ‘word’ which shows that tokens will be a word in this case. We can also specify tokens to be of single character or ngrams having a specified min-max range.

min_df = 10 specifies that a token will only be considered if it appears in a minimum of 10 documents (tweets) from the dataset.

max_df = 0.9 specifies that a token should not be considered if it appears in more than 90% of the total documents (tweets). Such a limitation can help eliminate the frequently occurring tokens in all the documents since such tokens do not play a significant role in the classification task.

This approach may convert textual data to numeric features, but it will give more importance to long documents during model training because they will higher count for each token.

To tackle this issue, we can resort to Tf-Idf vectorization…

Tf-Idf Vectorization

In this approach, we calculate two terms for each token namely Term Frequency (Tf) and Inverse Document Frequency (Idf):

Term Frequency (Tf) = No. of occurrences of a token in the document / Total tokens in the document

Inverse Document Frequency (Idf): log (no. of documents in the corpus/no. of documents in the dataset than contain the token)

Tf-Idf of each token is then calculated by multiplying Tf and Idf scores:

Tf-Idf = Tf * Idf

This Tf-Idf score ensures that the importance of a token is high when it occurs a lot in a given document and rarely in other documents.

In python, we can find Tf-Idf scores for each token using TfidfTransformer() after we have calculated the occurrences for each token using CountVectorizer():

tfidf_transformer = TfidfTransformer()
X_train = tfidf_transformer.fit_transform(X_train)
X_test = tfidf_transformer.transform(X_test)

You can either use CountVectorizer() and TfidfTransformer() to calculate Tf-Idf scores, or you can directly use TfidfVectorizer() since it combines both steps in a single method.  I have used a split approach only so you are able to understand the transformation of text into numerical features in a more intuitive way.

Now that we have successfully converted textual tweets into numerical features, let’s build our first binary-classification training model using Logistic Regression (LR) and evaluate how efficiently it can separate the disaster tweets from non-disaster tweets…

Logistic Regression for Text Classification

To train the Logistic Regression model on the disaster tweet data, we need to use LogisticRegression() from sklearn. At first, model is fitted on the training data with default parameters:

lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

The model training may take some time depending on your computing resources. Once the model is trained, we can predict the performance of the trained model on the testing data set:

lr_y_pred = lr_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, lr_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, lr_y_pred))
print('Classification Report:\n', classification_report(y_test, lr_y_pred))

Output:

Accuracy Score: 0.7905449770190414
Confusion Matrix:
 [[782 104]
 [215 422]]
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.88      0.83       886
           1       0.80      0.66      0.73       637

    accuracy                           0.79      1523
   macro avg       0.79      0.77      0.78      1523
weighted avg       0.79      0.79      0.79      1523

So, we get an overall accuracy of around 79% using default parameters. One point to note is the recall value for the disaster label. It is a bit on the low side and one reason for this is the slight unbalance in the dataset.

We can now balance the dataset and train our model again to see if the recall improves for the disaster tweets.

But, guess what? I am not going to do it, instead, you will be doing it for your practice, and don’t forget to share the results with other readers.

Let’s now train the model with a Support Vector Machine (SVM)…

SVM for Text Classification

To train the model using SVM, you can support vector classifier (SVC) from sklearn and specify the learning parameters i.e., C, gamma, kernel, etc.

In my case, I am using ‘rbf’ kernel and C=1 to train a binary classification model:

svm_model = SVC(C=1, kernel='rbf')
svm_model.fit(X_train, y_train)

Once the training phase is complete, evaluate the metrics of the trained model using the test data which we kept aside earlier:

svm_y_pred = svm_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, svm_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, svm_y_pred))
print('Classification Report:\n', classification_report(y_test, svm_y_pred))

Output:

Accuracy Score: 0.7931713722915299
Confusion Matrix:
 [[794  92]
 [223 414]]
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.90      0.83       886
           1       0.82      0.65      0.72       637

    accuracy                           0.79      1523
   macro avg       0.80      0.77      0.78      1523
weighted avg       0.80      0.79      0.79      1523

Out accuracy is a little bit improved compared to the logistic regression model. I bet we can further improve it using a different combination of the SVM model hyperparameters. So, leaving this task to you for further exploration.

Let’s now train Random Forest Classifier model on the same data set and see if we can further improve the accuracy or evaluation metrics…

Random Forest for Text Classification

Random Forest Classifier is basically a lot of decision tree classifiers combined to give the majority vote in favor of a class. You can train a Random Forest classifier using the scikit-learn library in Python:

rf_model = RandomForestClassifier(n_estimators=1000)
rf_model.fit(X_train, y_train)

‘n_estimators’ specify the number of decision trees which will be used to train the Random Forest classifier.

Let’s observe the accuracy of the Random Forest model on the test data set:

rf_y_pred = rf_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, rf_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, rf_y_pred))
print('Classification Report:\n', classification_report(y_test, rf_y_pred))

Output:

Accuracy Score: 0.7728168089297439
Confusion Matrix:
 [[746 140]
 [206 431]]
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.84      0.81       886
           1       0.75      0.68      0.71       637

    accuracy                           0.77      1523
   macro avg       0.77      0.76      0.76      1523
weighted avg       0.77      0.77      0.77      1523

We have got the accuracy to be 77% which is less than those of Logistic Regression and Support Vector Machines. In normal cases, such a scenario happens when the hyperparameters of the classifier are not tuned properly.

I can go on and tune the hyper-parameters for this model. However, this will go out of the scope of this short tutorial. So, give it a try and let me know if you were able to improve the accuracy by optimizing the hyperparameters of the Random Forest Classifier.

Let’s move towards Gradient Boosting Classifier…

Gradient Boosting for Text Classification

Gradient Boosting classifier is a combined sequential model of various Decision Trees, where the outcome of one decision tree is used to train the next one and so on.

To train a model using Gradient Boosting Classifier, we can use sklearn.ensemble.GradientBoostingClassifier() and specify different values of the hyperparameters.

I have specified ‘learning_rate’ which enforces the impact of one tree onto the next one and ‘n_estimators’ which specifies the total number of decision trees to be used during training.

gb_model = GradientBoostingClassifier(n_estimators=500, learning_rate=0.1)
gb_model.fit(X_train, y_train)

Once the model has been trained, you can evaluate its accuracy and different metrics as follows:

gb_y_pred = gb_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, gb_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, gb_y_pred))
print('Classification Report:\n', classification_report(y_test, gb_y_pred))

Output:

Accuracy Score: 0.7669074195666448
Confusion Matrix:
 [[789  97]
 [258 379]]
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.89      0.82       886
           1       0.80      0.59      0.68       637

    accuracy                           0.77      1523
   macro avg       0.77      0.74      0.75      1523
weighted avg       0.77      0.77      0.76      1523

Again, the accuracy, in this case, is lower than the previously trained models. And again, you have the opportunity to prove yourself…

Conclusion

In this tutorial, text classification models are built to distinguish the disaster tweets from the non-disaster ones. Firstly, we clean up the extra text from tweets including spaces, punctuation, special characters, numbers and stop words. Then the tokenized tweets are lemmatized and numerical features are extracted using Tf-Idf vectorization.

Finally, binary classification models are trained using Logistic Regression, Support Vector Machines, Random Forest Classifier, and Gradient Boosting Classifier. Evaluation of these machine learning models indicate that almost 80% of the tweets are recognized successfully to be either disaster or non-disaster.

Sharing is Caring!

Subscribe for Latest Articles

Don't miss new updates on your email!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top