How to train CNN for multi-label text classification

How to train CNN for multi-label text classification

Sharing is Caring!

In natural language processing, text classification has been revolutionized by the advancements in deep learning especially recurrent neural networks (RNN) and convolutional neural networks (CNN). In fact, ‘CNN for text classification’ is the new sexy for NLP practitioners in industry, research, and academia because of the efficient encoding of language semantics and context into mathematical vectors.

Previously, bag-of-words and TF-IDF vectorization were mostly employed to convert textual documents into mathematical vectors. These encoded vectors were then used in various machine learning algorithms for binary or multi-label text classification tasks. This approach was discussed in great detail in one of my previous blog posts:

The bag-of-words approach has an inherent flaw of limited vocabulary and text encoding without considering the semantics/context of different words in a text document. These limitations proved to be a major setback for the NLP tasks which required diverse vocabulary and contextual encoding.

To address such limitations CNN based approaches were preferred which were already trained on large text corpus using semantic encoding. One such encoding is known as GloVe word embedding which will be used in this tutorial to train a CNN for multi-label text classification for 20 newsgroup datasets.

This tutorial will feature the following tools/libraries to classify the news into one of the 20 groups:

  • Python
  • Jupyter Notebook
  • scikit-learn
  • pandas
  • seaborn
  • matplotlib
  • nltk
  • tensorflow
  • keras

Let’s first make sense of the 20 newsgroups dataset…

20 Newsgroups Dataset

This dataset represents a collection of around 18000 documents from 20 different news groups. It is a de-facto standard for training and evaluating the performance of machine/deep learning algorithms for multi-label text classification.

This dataset is available in many forms all over the internet. Some resources list the individual files for each newsgroup, whereas some list the individual folders for each newsgroup with all files for the corresponding category in that folder. Whatever the case, we get the following 20 groups of the news for multi-label text classification:

  • comp.graphics
  • comp.os.ms-windows.misc
  • comp.sys.ibm.pc.hardware
  • comp.sys.mac.hardware
  • comp.windows.x rec.autos
  • rec.motorcycles
  • rec.sport.baseball
  • rec.sport.hockey sci.crypt
  • sci.electronics
  • sci.med
  • sci.space
  • misc.forsale talk.politics.misc
  • talk.politics.guns
  • talk.politics.mideast talk.religion.misc
  • alt.atheism
  • soc.religion.christian

In this tutorial, we will directly load the data using the scikit-learn library. You can, however, download the 20 newsgroups dataset using Kaggle:

Dataset Link: https://www.kaggle.com/datasets/crawford/20-newsgroups

Article Notebook

Code: Text Classification for20 Newsgroups dataset using CNN and GloVe Embeddings

Prepared By: Awais Naeem

Copyrights: www.embedded-robotics.com

Disclaimer: This code can be distributed with the proper mention of the owner’s copyrights

Notebook Link: https://github.com/embedded-robotics/datascience/blob/master/nlp_text_classification_pretrained_glove_embeddings/text_classification_glove_embeddings.ipynb

Data Reading and Exploratory Analysis

To follow along with this tutorial, set up your python environment including all the necessary libraries i.e., pandas, numpy, seaborn, matplotlib, nltk, sklearn, TensorFlow, Keras. Let’s first import all the libraries:

import os
import re
import nltk
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.datasets import fetch_20newsgroups
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

To use `nltk`, you will need to download some of the resources required for text preprocessing:

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download("wordnet")
nltk.download("stopwords")
nltk.download('omw-1.4')

The next step is to download the training and test instances for 20 newsgroups dataset using sklearn.datasets.fetch_20newsgroups():

newsgroup_train = fetch_20newsgroups(subset='train', shuffle=True)
newsgroup_test = fetch_20newsgroups(subset='test', shuffle=True)

Now you can verify the names of different newsgroups in the downloaded training data:

 print(newsgroup_train.target_names)

Output:

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

For further processing of the data, let’s make a pandas data frame having columns for article and the corresponding category label for both training and testing data:

df_train = pd.DataFrame({'article': newsgroup_train.data, 'label': newsgroup_train.target})
df_train.head()

Output:

	article	                label
0	From: lerxst@wam.umd.edu (where's my thing)\nS...	7
1	From: guykuo@carson.u.washington.edu (Guy Kuo)...	4
2	From: twillis@ec.ecn.purdue.edu (Thomas E Will...	4
3	From: jgreen@amber (Joe Green)\nSubject: Re: W...	1
4	From: jcm@head-cfa.harvard.edu (Jonathan McDow...	14
df_test = pd.DataFrame({'article': newsgroup_test.data, 'label': newsgroup_test.target})
df_test.head()

Output:

	article	label
0	From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. ...	7
1	From: Rick Miller <rick@ee.uwm.edu>\nSubject: ...	5
2	From: mathew <mathew@mantis.co.uk>\nSubject: R...	0
3	From: bakken@cs.arizona.edu (Dave Bakken)\nSub...	17
4	From: livesey@solntze.wpd.sgi.com (Jon Livesey...	19

Before applying any further processing to the raw data, let’s do exploratory analysis to gather interesting insights.

For text-oriented data, the most common insight is to look at the word’s distribution in the complete dataset:

df_train['word_count'] = df_train['article'].apply(lambda x: len(str(x).split()))
plt.figure(figsize=(10,8))
sns.histplot(data=df_train, x='word_count')
plt.title('Word Count of Articles in Train Data')
plt.xlabel('Word Count')
plt.ylabel('Article Count')
plt.show()
Word Distribution of News Articles in Raw Train Data
Word Distribution of News Articles in Raw Train Data

From the above plot, it seems that most of the news documents have a word count of around 1000. Let’s find out the percentage of the documents containing at most 1000 words:

train_articles = (sum(df_train['word_count'] < 1000)/df_train.shape[0])*100
print('Percentage of Training Articles having less than 1000 Words:{:.2f}%'.format(train_articles))

Output:

Percentage of Training Articles having less than 1000 Words:96.80%

With almost 97% of the documents containing less than 1000 words, we can safely truncate all the documents to the first 1000 words without losing any critical information.

Let’s find out if the 1000-word truncate boundary also holds true for the test data:

df_test['word_count'] = df_test['article'].apply(lambda x: len(str(x).split()))
plt.figure(figsize=(10,8))
sns.histplot(data=df_test, x='word_count')
plt.title('Word Count of Articles in Test Data')
plt.xlabel('Word Count')
plt.ylabel('Article Count')
plt.show()
Word Distribution of News Articles in Raw Test Data
Word Distribution of News Articles in Raw Test Data
test_articles = (sum(df_test['word_count'] < 1000)/df_test.shape[0])*100
print('Percentage of Test Articles having less than 1000 Words:{:.2f}%'.format(test_articles))

Output:

Percentage of Test Articles having less than 1000 Words:97.09%

So, we can safely assume the 1000-word truncate window for all the documents in the dataset.

Next, we need to verify the balanced class distribution for all the 20 newsgroups in the training data:

plt.figure(figsize=(15,10))
sns.countplot(data=df_train, x='label')
plt.title('Train Data Topics Distribution')
plt.xlabel('Article Topics')
plt.ylabel('Topic Count')
plt.show()
Topic Distribution in Training Data for multi-label text classification
Topic Distribution in Training Data for multi-label text classification

Our training data seems fairly balanced and is good to go for further pre-preprocessing without any up sampling or down sampling required.

Let’s now head towards the data cleaning and pre-processing…

Data Cleaning and Pre-processing

20 newsgroup dataset is in textual form and we will need to convert this data into a mathematical form so that it can be fed into any deep learning or machine learning algorithms. For this tutorial, we will achieve this feat using GloVe Word embeddings.

However, before we go on to convert each word into its respective GloVe embedding, we need to perform the following cleaning operations on the text data to get rid of unnecessary textual data:

  1. Remove pre and post spaces from each tweet using strip()
  2. Convert all the tweets into lowercase using lower()
  3. Substitute new line character (\n) with space using re.sub()
  4. Remove punctuations and special characters from each individual word using isalnum()
  5. Remove English stop words (the, which, where, and, is, how, who) from the text using nltk.corpus.stopwords()
  6. Remove any word having less than 3 characters since such a word does not contribute any solid information about the news category
  7. Lemmatize each token in the text using WordNetLemmatizer() which will convert each token based on its context within the text e.g., Caring -> Care, Stripes (Verb) -> Strip (Verb), Stripes (Noun) -> Stripe (Noun), Walking -> Walk, Running-> Run

To remove the stop words, a list of English stop words is generated using nltk.corpus.stopwords(). The text data is then tokenized into separate words, and only those words are kept in the final tokenized list which is not in the list of stop words generated by nltk.corpus.stopwords().

WordNetLemmatizer() needs the wordnet tags (wordnet.ADJ, wordnet.VERB, wordnet.NOUN, wordnet.ADV) for each word to perform the lemmatization. We can use nltk.pos_tag() to generate the tag for each token and then write a wrapper function to convert the nltk tags into wordnet tags i.e., get_wordnet_pos (tag):

def get_wordnet_pos (tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

We have also defined a function to lemmatize each word in a tokenized text data and then return all the lemmatized tokens combined into a string:

def lemmatize (word_list):
    wl = WordNetLemmatizer()
    word_pos_tags = pos_tag(word_list)
    lemmatized_list = []
    for tag in word_pos_tags:
        lemmatize_word = wl.lemmatize(tag[0],get_wordnet_pos(tag[1]))
        lemmatized_list.append(lemmatize_word)
    return " ".join(lemmatized_list)

All of the above text-cleaning operations are performed in a custom python function which is then applied to each tweet using apply() method of the pandas data frame:

def clean_text (text):
    # Remove Pre and Post Spaces
    text = str(text).strip()
    
    # Lower case the entire text
    text = str(text).lower()

    # Substitute New Line Characters with spaces 
    text = re.sub(r"\n", r" ", text)
        
    # Tokenize the sentence
    word_tokens = word_tokenize(text)
    
    # Remove the punctuation and  special characters from each individual word
    cleaned_text = []
    for word in word_tokens:
        cleaned_text.append("".join([char for char in word if char.isalnum()]))
    
    # Specify the stop words list
    stop_words = stopwords.words('english')
    
    # Remove the stopwords and words containing less then 2 characters
    text_tokens = [word for word in cleaned_text if (len(word) > 2) and (word not in stop_words)]
    
    #Lemmatize each word in the word list
    text = lemmatize (text_tokens)
    
    return text

Let’s look at the efficiency our text cleaning procedure by applying the function on a dummy news article:

df_train['article'][0]

Output:

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"
clean_text (df_train['article'][0])

Output:

'lerxst wamumdedu thing subject car nntppostinghost rac3wamumdedu organization university maryland college park line wonder anyone could enlighten car saw day 2door sport car look late 60 early 70 call bricklin door really small addition front bumper separate rest body know anyone tellme model name engine spec year production car make history whatever info funky look car please email thanks bring neighborhood lerxst'

Hmmm… Nice cleaning!

Let’s apply the cleaning procedure to both the training and test data frames using apply() method:

df_train['article'] = df_train['article'].apply(lambda x: clean_text(x))
df_test['article'] = df_test['article'].apply(lambda x: clean_text(x))

Since the cleaning has stripped off unnecessary words from the raw text, we need to calculate the word distribution again on this cleaned dataset:

df_train['word_count'] = df_train['article'].apply(lambda x: len(str(x).split()))
plt.figure(figsize=(10,8))
sns.histplot(data=df_train, x='word_count')
plt.title('Word Count of Articles in Train Data after data cleaning')
plt.xlabel('Word Count')
plt.ylabel('Article Count')
plt.show()
Word Distribution of News Articles in Cleaned Train Data
Word Distribution of News Articles in Cleaned Train Data
train_articles = (sum(df_train['word_count'] < 300)/df_train.shape[0])*100
print('Percentage of Training Articles having less than 300 Words:{:.2f}%'.format(train_articles))

Output:

Percentage of Training Articles having less than 300 Words:92.05%

So, the 1000-word truncate boundary for the news articles in the raw data set has come down to a 300-word boundary in the cleaned dataset. NICE, isn’t it?

Let’s see if the 300-word boundary also holds true for the test data set:

df_test['word_count'] = df_test['article'].apply(lambda x: len(str(x).split()))
plt.figure(figsize=(10,8))
sns.histplot(data=df_test, x='word_count')
plt.title('Word Count of Articles in Test Data after data cleaning')
plt.xlabel('Word Count')
plt.ylabel('Article Count')
plt.show()
Word Distribution of News Articles in Cleaned Test Data
Word Distribution of News Articles in Cleaned Test Data
test_articles = (sum(df_test['word_count'] < 300)/df_test.shape[0])*100
print('Percentage of Test Articles having less than 300 Words:{:.2f}%'.format(test_articles))

Output:

Percentage of Test Articles having less than 300 Words:92.37%

So, we can now safely truncate our news articles to 300-word which will help us save computing resources and time during model training.

Let’s extract the training and testing articles/labels into separate pandas’ series before we go on and convert the articles into numerical features using GloVe word embeddings:

X_train = df_train['article']
y_train = df_train['label']
X_test = df_test['article']
y_test = df_test['label']
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

Output:

X_train: (11314,)
X_test: (7532,)
y_train: (11314,)
y_test: (7532,)

GloVe Word Embeddings

Until now, I have used the 3-word sentence ‘GloVe Word Embeddings’ many a time but did not even explain what it actually means. So, let’s unravel this concept…

In simple terms, GloVe Word Embeddings are the pretrained representation of each word into a mathematical vector using a large corpus of text. These mathematical vectors are then used to replace each word in an article or textual document.

Essentially, GloVe embeddings are used to convert textual information into a mathematical form which is then inputted into a deep learning network for further fine-tuning in the context of a specific application.

During the training, GloVe ensures that similar words are clustered together and possess close-by vector embeddings, whereas different words are placed at a greater distance from each other with far-away embedding vectors.

The most important catch here is that GloVe does not only rely on the contextual information of the words but also incorporates word occurrences to obtain word vectors. And since GloVe encompasses a large corpus of text, the prediction of the model is only limited to the vocabulary present in the training data.

If you need to look into more technical details about GloVe, you can refer to the following article:

Pre-trained GloVe embeddings can be downloaded from the official page of the Stanford University NLP group. GloVe word embeddings are created by training on different text corpus and are available in a different number of tokens as well as different dimensions for each token.

For this tutorial, I will use the one trained on Wikipedia/Gigaword Corpus having 6B tokens with 100 dimensions for each token. You can try out the other trained embeddings as well because the procedure to use GloVe embeddings in a deep neural network will stay the same.

Now that you have a slight idea about GloVe word embeddings, let’s first learn how we can use the Keras embedding layer into CNN for text classification. Later, we will learn how to use a pre-trained word embedding matrix into the embedding layer for multi-label text classification.

However, before all that, we need to tokenize our training/testing data, convert each article into text sequences and then pad all the text sequences to have a constant length of 300.

Word Tokenization for Multi-Label Text Classification

In this tutorial, we will use the Keras tokenizer from the text preprocessing library i.e., keras.preprocessing.text.Tokenizer(). Essentially, the Keras tokenizer processes all the individual tokens in the dataset articles and then assigns an ID to each unique token in the form of a dictionary.

During initialization, we need to define num_words which specifies the number of most frequent words to be used in the tokenizer vocabulary:

tokenizer = Tokenizer(num_words=100000)
tokenizer.fit_on_texts(X_train)

Once the tokenizer has processed the training data, we can see the indexed vocabulary of the training data generated by the tokenizer as follows:

print(tokenizer.index_word)

Output:

{1: 'line',
 2: 'subject',
 3: 'organization',
 4: 'would',
 5: 'one',
 6: 'write',
 7: 'use',
 8: 'get',
 9: 'say',
 10: 'article',
 11: 'know',
 12: 'people',
 13: 'like',
 14: 'make',
 15: 'think',
 16: 'university',
 17: 'time',
 18: 'nntppostinghost',
 19: 'max',
 20: 'well',
 21: 'good',
 22: 'also',
 23: 'see',
 24: 'new',
 25: 'work',
...
 997: 'soldier',
 998: 'ability',
 999: 'commit',
 1000: 'ken',
 ...}

Let’s now see the total vocabulary size for our training data:

vocab_size = len(tokenizer.index_word) + 1
print('Vocab Size:', vocab_size)

Output:

Vocab Size: 148442

So, there is a total of 148442 unique tokens in our training data, each of which has a unique ID in the indexed tokenizer dictionary.

Now we can convert each article into our training/testing data such that each tokenized word is replaced by its mathematical index as represented in the tokenizer.index_word:

X_train_token = tokenizer.texts_to_sequences(X_train)
X_test_token = tokenizer.texts_to_sequences(X_test)

Let’s see how the first two articles in the training data are converted to tokenizer sequences:

print("First Instance Text:\n")
print(X_train[0])
print("\nFirst Instance Total Words:", len(str(X_train[0]).split()))

Output:

First Instance Text:
lerxst wamumdedu thing subject car nntppostinghost rac3wamumdedu organization university maryland college park line wonder anyone could enlighten car saw day 2door sport car look late 60 early 70 call bricklin door really small addition front bumper separate rest body know anyone tellme model name engine spec year production car make history whatever info funky look car please email thanks bring neighborhood lerxst

First Instance Total Words: 62
print("First Instance Text Sequence:\n")
print(X_train_token[0])
print("\nFirst Instance Text Sequence Length:", len(X_train_token[0]))

Output:

First Instance Text Sequence:
[26912, 4593, 36, 2, 97, 18, 18457, 3, 16, 2162, 441, 980, 1, 419, 58, 27, 5485, 97, 525, 76, 18458, 1043, 97, 40, 498, 9326, 531, 7189, 53, 26913, 899, 71, 248, 926, 640, 5283, 1127, 584, 414, 11, 58, 41656, 374, 106, 645, 1922, 29, 1952, 97, 14, 429, 618, 239, 18459, 40, 97, 68, 78, 90, 420, 4079, 26912]

First Instance Text Sequence Length: 62

You can see that the first article had a total of 62 tokenized words which are converted into their corresponding dictionary index within the sequence generated by Keras Tokenizer.

For further understanding, let’s see the text and the corresponding sequence generated by Keras Tokenizer for the second article:

print("Second Intance Text:\n")
print(X_train[1])
print("\nSecond Intance Total Words:", len(str(X_train[1]).split()))

Output:

Second Instance Text:
guykuo carsonuwashingtonedu guy kuo subject clock poll final call summary final call clock report keywords acceleration clock upgrade articleid shelley1qvfo9innc3s organization university washington line nntppostinghost carsonuwashingtonedu fair number brave soul upgrade clock oscillator share experience poll please send brief message detail experience procedure top speed attain cpu rat speed add card adapter heat sink hour usage per day floppy disk functionality 800 floppy especially request summarize next two day please add network knowledge base do clock upgrade answer poll thanks guy kuo guykuo uwashingtonedu

Second Instance Total Words: 84
print("Second Intance Text Sequence:\n")
print(X_train_token[1])
print("\nSecond Intance Text Sequence Length:", len(X_train_token[1]))

Output:

Second Instance Text Sequence:
[10702, 3064, 335, 7862, 2, 1006, 3096, 654, 53, 442, 654, 53, 1006, 237, 241, 3574, 1006, 988, 403, 62852, 3, 16, 536, 1, 18, 3064, 1260, 65, 1333, 1334, 988, 1006, 5984, 707, 326, 3096, 68, 123, 2079, 179, 544, 326, 1822, 464, 271, 7863, 1245, 2219, 271, 320, 122, 1839, 1619, 4196, 778, 2321, 546, 76, 911, 223, 3957, 1632, 911, 529, 494, 3922, 202, 48, 76, 68, 320, 351, 736, 172, 119, 1006, 988, 194, 3096, 90, 335, 7862, 10702, 4438]

Second Instance Text Sequence Length: 84

The second article having 84 tokenized words is converted into a sequence of length 84, with each tokenized word represented by its relevant index in the tokenizer dictionary. So far, so good…

You may have noticed that articles having different word lengths are being converted into sequences having different lengths e.g., article 1 having 62 words was converted into a sequence of 62 indices, whereas article 2 having 84 words was encoded into a sequence having 84 relevant indices.

While it may seem simple to human understanding, deep learning networks need to compose the sequence of each article into a unified matrix having the same dimensions for each encoded sequence. To achieve such uniformity, we need to pad or truncate the encoded sequences into fixed-length vectors irrespective of the word length in each article.

If you remember correctly, we identified that almost 92% of articles in the cleaned data contain 300 words or less. So now we will post-pad these sequences to a maximum length of 300.

The post padding will ensure that articles having less than 300 words will be post-padded with 0’s whereas the articles having more than 300 words will be truncated from the end. Let’s see how this post-padding looks for articles 1 and article 2:

sequence_len = 300
X_train_token = pad_sequences(X_train_token, padding='post', maxlen=sequence_len)
X_test_token = pad_sequences(X_test_token, padding='post', maxlen=sequence_len)
print("First Instance Text Sequence:\n")
print(X_train_token[0])
print("\nFirst Instance Text Sequence Length:", len(X_train_token[0]))

Output:

First Instance Text Sequence:
[26912  4593    36     2    97    18 18457     3    16  2162   441   980
     1   419    58    27  5485    97   525    76 18458  1043    97    40
   498  9326   531  7189    53 26913   899    71   248   926   640  5283
  1127   584   414    11    58 41656   374   106   645  1922    29  1952
    97    14   429   618   239 18459    40    97    68    78    90   420
  4079 26912     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
...
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0]

First Instance Text Sequence Length: 300
print("Second Instance Text Sequence:\n")
print(X_train_token[1])
print("\nSecond Instance Text Sequence Length:", len(X_train_token[1]))

Output:

Second Instance Text Sequence:
[10702  3064   335  7862     2  1006  3096   654    53   442   654    53
  1006   237   241  3574  1006   988   403 62852     3    16   536     1
    18  3064  1260    65  1333  1334   988  1006  5984   707   326  3096
    68   123  2079   179   544   326  1822   464   271  7863  1245  2219
   271   320   122  1839  1619  4196   778  2321   546    76   911   223
  3957  1632   911   529   494  3922   202    48    76    68   320   351
   736   172   119  1006   988   194  3096    90   335  7862 10702  4438
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
...
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0]

Second Instance Text Sequence Length: 300

Now that we have the padded sequences ready to be fed into a convolutional neural network (CNN), let’s explore how we can use CNN for text classification…

CNN for Text Classification

To use CNN for text classification, we will use the Sequential() model in Tensorflow Keras and add the layers into this model one by one.

In this section, we will train our own word embeddings using the news training data via the Embedding() layer of the Keras. Once we know how to train custom word embeddings in a deep learning network, we can then incorporate the pre-trained GloVe embeddings into the network.

For custom word embeddings, we will need to specify the vocabulary size of the training dataset, the sequence length of each padded text sequence, and the embedding dimensions of each token/word in the Embedding() layer.

For this tutorial, we will stick to 100 embedding dimensions for each token and a sequence length of a maximum of 300 tokens:

embedding_dim = 100
model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=sequence_len))

Now, we need to define a 1-Dimensional convolutional layer with a specific filter and kernel size followed by a 1-Dimensional global MaxPooling layer:

model.add(layers.Conv1D(filters=128, kernel_size=5, activation='relu'))
model.add(layers.GlobalMaxPool1D())

To complete the model, we will add fully connected hidden network layers with ‘relu’ activation and an output layer having the same no. of neurons as that of the news categories (20) with ‘softmax’ activation:

model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(20, activation='softmax'))

Now we can compile the model by defining the optimization and loss calculation algorithms:

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

Output:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 300, 100)          14844200  
_________________________________________________________________
conv1d (Conv1D)              (None, 296, 128)          64128     
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_2 (Dense)              (None, 20)                660       
=================================================================
Total params: 14,919,324
Trainable params: 14,919,324
Non-trainable params: 0
_________________________________________________________________

Now we can fit the CNN model for text classification using the training dataset, whereas the validation dataset will be used to calculate the validation loss/accuracy during training:

history = model.fit(X_train_token, y_train, epochs=20, validation_data=(X_test_token, y_test), batch_size=128)

Because of limited computing resources, I have restricted the training to only 20 epochs where I achieved almost 100% training accuracy and 77% testing accuracy:

metrics_df = pd.DataFrame(history.history)
print(metrics_df)

Output:

        loss  accuracy  val_loss  val_accuracy
0   2.846578  0.202051  2.360817      0.324482
1   1.140805  0.709829  0.982819      0.720127
2   0.203573  0.958900  0.894774      0.763144
3   0.036105  0.996641  0.933649      0.763808
4   0.014403  0.998763  0.991637      0.759161
5   0.009420  0.998851  1.032481      0.757833
6   0.009345  0.998586  1.036683      0.763808
7   0.005411  0.998939  1.054557      0.762878
8   0.004285  0.999205  1.074672      0.764604
9   0.004339  0.999293  1.101174      0.760489
10  0.002901  0.999381  1.152618      0.755311
11  0.003112  0.999381  1.139402      0.763409
12  0.003092  0.999381  1.154414      0.759692
13  0.002861  0.999381  1.221977      0.753585
14  0.001516  0.999646  1.186094      0.758364
15  0.002195  0.999470  1.233693      0.756240
16  0.001180  0.999646  1.204741      0.760754
17  0.001107  0.999823  1.271995      0.753850
18  0.001149  0.999646  1.247182      0.760489
19  0.000599  0.999823  1.223479      0.764737

We can now plot the loss and accuracy for validation/testing datasets using matplotlib and get an idea of the model training:

plt.figure(figsize=(10,5))
plt.plot(metrics_df.index, metrics_df.loss)
plt.plot(metrics_df.index, metrics_df.val_loss)
plt.title('Newsgroup 20 Dataset Neural Network Training with Corpus Word Embeddings')
plt.xlabel('Epochs')
plt.ylabel('Categorical Crossentropy Loss')
plt.legend(['Training Loss', 'Validation Loss'])
plt.show()
Training and Validation Loss during training CNN for text classification
Training and Validation Loss during training CNN for text classification
plt.figure(figsize=(10,5))
plt.plot(metrics_df.index, metrics_df.accuracy)
plt.plot(metrics_df.index, metrics_df.val_accuracy)
plt.title('Newsgroup 20 Dataset Neural Network Training with Corpus Word Embeddings')
plt.xlabel('Epochs')
plt.ylabel('Acuracy')
plt.legend(['Training Accuracy', 'Validation Accuracy'])
plt.show()
Training and Validation Accuracy during training CNN for text classification
Training and Validation Accuracy during training CNN for text classification

Multi-label Text Classification using GloVe Embeddings

To use pre-trained GloVe embeddings for multi-label text classification, we will firstly need to create the embedding matrix using the downloaded glove embeddings and then use the embedding matrix as pre-trained weights in the deep neural network.

To create the embedding matrix, we need to look at the entire vocabulary of the training dataset created by Keras tokenizer i.e., tokenizer.word_index. For each word in the training data vocabulary, we will check whether the word exists in the pre-trained GloVe embeddings or not.

If the word exists, we will add its GloVe embedding in the corresponding index of the embedding matrix, else, we will simply add zero vectors.

def create_embedding_matrix (filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    
    with open(filepath) as file:
        for line in file:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix[idx] = np.array(vector, dtype=np.float32)[:embedding_dim]
    
    return embedding_matrix

embedding_dim = 100
embedding_matrix = create_embedding_matrix(glove_embedding_filepath, tokenizer.word_index, embedding_dim)

training data vocabulary. Here we have used 100 dimensions to represent each index into a mathematical vector.

Once we have the embedding matrix, we can simply use the previously defined deep learning model, albeit with two changes in the Embedding Layer.

Firstly, we need to initialize the weights of the Embedding layer with the embedding matrix. Secondly, we will need to assign trainable=True so that model weights are fine-tuned as per the news categories in the training data:

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=sequence_len, weights = [embedding_matrix], trainable = True))
model.add(layers.Conv1D(filters=128, kernel_size=5, activation='relu'))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(20, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

Output:

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 300, 100)          14844200  
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 296, 128)          64128     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_4 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_5 (Dense)              (None, 20)                660       
=================================================================
Total params: 14,919,324
Trainable params: 14,919,324
Non-trainable params: 0
_________________________________________________________________

Now, we can start the model training using model.fit() which will take considerable time and consume computing resources during the training.

For this tutorial, I limited the training to only 20 epochs because of the limited computing resources and still achieved the training accuracy of around 100% and validation accuracy of around 73% at the end of the 20th epoch.

history1 = model.fit(X_train_token, y_train, epochs=20, validation_data=(X_test_token, y_test), batch_size=128)
metrics_df = pd.DataFrame(history1.history)
print(metrics_df)

Output:

        loss  accuracy  val_loss  val_accuracy
0   2.485635  0.277709  1.823494      0.487122
1   1.134279  0.675800  1.181438      0.618826
2   0.565645  0.847888  1.024888      0.686803
3   0.279554  0.941135  0.966511      0.706054
4   0.126243  0.982500  1.027518      0.707515
5   0.054279  0.996111  1.031550      0.716543
6   0.026952  0.998851  1.059474      0.719995
7   0.017124  0.999028  1.087930      0.722385
8   0.012494  0.999205  1.104241      0.723447
9   0.009684  0.999293  1.149722      0.724774
10  0.008914  0.999028  1.131223      0.728226
11  0.006960  0.999205  1.196285      0.726766
12  0.007568  0.999293  1.232153      0.723181
13  0.006152  0.999470  1.228833      0.725836
14  0.006196  0.999381  1.236231      0.726102
15  0.006036  0.999381  1.227614      0.730749
16  0.007545  0.999205  1.273850      0.716277
17  0.004864  0.999470  1.381622      0.710170
18  0.006044  0.999293  1.287276      0.725571
19  0.005425  0.999558  1.249408      0.730350

To plot the loss and accuracy for both training/validation sets, we can make use of matplotlib and the metrics_df:

plt.figure(figsize=(10,5))
plt.plot(metrics_df.index, metrics_df.loss)
plt.plot(metrics_df.index, metrics_df.val_loss)
plt.title('Newsgroup 20 Dataset Neural Network Training with GLOVE Word Embeddings')
plt.xlabel('Epochs')
plt.ylabel('Categorical Crossentropy Loss')
plt.legend(['Training Loss', 'Validation Loss'])
plt.show()
Training and Validation Loss during training of GloVe embeddings for multi-label text classification
Training and Validation Loss during training of GloVe embeddings for multi-label text classification
plt.figure(figsize=(10,5))
plt.plot(metrics_df.index, metrics_df.accuracy)
plt.plot(metrics_df.index, metrics_df.val_accuracy)
plt.title('Newsgroup 20 Dataset Neural Network Training with GLOVE Word Embeddings')
plt.xlabel('Epochs')
plt.ylabel('Acuracy')
plt.legend(['Training Accuracy', 'Validation Accuracy'])
plt.show()
Training and Validation Accuracy during training of GloVe embeddings for multi-label text classification
Training and Validation Accuracy during training of GloVe embeddings for multi-label text classification

Conclusion

In this tutorial, we have employed CNN for text classification of the 20 newsgroups dataset and trained a multi-label text classification model to classify the news articles into one of the 20 news categories. Moreover, we have used pre-trained GloVe embeddings within the CNN model to introduce the concept of static word embeddings for multi-label text classification.

During the prediction, articles categories were predicted with an accuracy of around 100% for training data and 75% for the validation data. These results could further be improved by training the models for larger epochs and using a GloVe embedding with more tokens and/or dimensions.

Sharing is Caring!

Subscribe for Latest Articles

Don't miss new updates on your email!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top