Subject Modeling with ML Methods


Subject modeling is a technique to make use of and establish the themes that exist in giant units of knowledge. It’s a sort of unsupervised studying approach the place the mannequin tries to foretell the presence of underlying subjects with out floor reality labels. It’s useful in a variety of industries, together with healthcare, finance, and advertising and marketing, the place there’s loads of text-based information to research. Utilizing matter modeling, organizations can shortly achieve helpful insights from the subjects that matter most to their enterprise that may assist them make higher choices and enhance their services and products.

This text was printed as part of the Data Science Blogathon.

Mission Description

Subject modeling is efficacious for quite a few industries, together with and never restricted to finance, healthcare, and advertising and marketing. It’s helpful for industries that take care of large quantities of unstructured textual content information, comparable to, buyer opinions, social media posts, or medical information, as it may well assist cut back the huge period of time and labor to do the identical with out machines.

For instance, within the healthcare business, matter modeling can establish widespread themes or patterns in affected person information that may assist enhance affected person outcomes, establish threat elements, and information medical decision-making. In finance, matter modeling can analyze information articles, monetary reviews, and different textual content information to establish tendencies, market sentiment, and potential funding alternatives.

In advertising and marketing business, matter modeling can analyze buyer suggestions, social media posts, and different textual content information to establish buyer wants and preferences and develop focused advertising and marketing campaigns. This can assist firms enhance buyer satisfaction, enhance gross sales, and achieve a aggressive market edge.

Basically, matter modeling can assist to realize insights from giant quantities of textual content information shortly and effectively. By figuring out key subjects or themes, organizations could make knowledgeable choices, enhance their services and products, and achieve a aggressive benefit of their respective industries.

Downside Assertion

The goal is to do matter modeling on the 1,000,000 headlines information dataset. It’s a assortment of over a million information article headlines printed by the ABC.

Utilizing LDA, this mission goals to establish the primary subjects and canopy the themes within the information headlines dataset. LDA is a probabilistic generative mannequin that assumes that every doc is a mix of a number of subjects. Each methods have their benefits in addition to disadvantages, and the mission explores which approach is best fitted to analyzing the information headlines dataset.

By figuring out the primary themes within the information headlines dataset. The mission goals to supply insights into the varieties of information tales that may cowl the ABC. Use this data by journalists, editors, and media organizations to raised perceive their viewers and to tailor their information protection to fulfill the wants and pursuits of their readers.

Dataset Description

The dataset accommodates a big assortment of reports headlines printed over a interval of nineteen years, between February 19, 2003, and December 31, 2021. The information is sourced from the Australian Broadcasting Company (ABC), a good information group in Australia. The dataset is offered in CSV format and accommodates two columns: “publish_date” and “headline_text“.

The “publish_date” column supplies the date when the information article was printed, within the YYYYMMDD format. The “headline_text” column accommodates the textual content of the headline, written in ASCII, English, and lowercase.

Mission Plan

The mission steps for making use of matter modeling to the information headlines dataset may be as observe:

1. Exploratory Knowledge Evaluation: The following step is analyzing the info to grasp the distribution of headlines over time. The frequency of various phrases and phrases, and different patterns within the information. Additionally, you may visualizing the info utilizing charts and graphs to realize insights into the info.

2. Knowledge Pre-processing: Step one is cleansing and preprocessing the textual content to take away cease phrases, punctuation, and so forth. It additionally includes tokenization, stemming, and lemmatization to standardize the textual content information and make it appropriate for evaluation.

3. Subject Modeling: The core of the mission is making use of methods comparable to LDA. Then, establish the primary subjects and themes within the information headlines dataset. It requires deciding on the suitable parameters for the subject modeling algorithms. For instance, the variety of subjects, the dimensions of the vocabulary, and the similarity measure.

4. Subject Interpretation: After figuring out the primary subjects, the following step is deciphering the subjects and assigning human-readable labels to them. It consists of analyzing the highest phrases and phrases related to every matter and figuring out the primary themes and tendencies.

5. Analysis: The ultimate step includes evaluating the efficiency of the subject modeling algorithms. Then, evaluating them based mostly on metrics comparable to coherence rating and perplexity. Figuring out the constraints and challenges of the subject modeling strategy and proposing potential options.

Steps for The Mission

First, importing the mandatory libraries.

import numpy as np
import pandas as pd
from import show
from tqdm import tqdm
from collections import Counter

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.textual content import CountVectorizer
from textblob import TextBlob
import scipy.stats as stats

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.manifold import TSNE
from wordcloud import WordCloud, STOPWORDS

from bokeh.plotting import determine, output_file, present
from bokeh.fashions import Label
from import output_notebook

%matplotlib inline

Loading the csv format information in dataframe whereas parsing the dates in usable format.

path="/content material/drive/MyDrive/topic_modeling/abcnews-date-text.csv" #path of your dataset
df = pd.read_csv(path, parse_dates=[0], infer_datetime_format=True)

reindexed_data = df['headline_text']
reindexed_data.index = df['publish_date']

Seeing a glimpse of the loaded information by means of first 5 rows.


There are 2 columns named publish_date and headline_text as talked about above within the dataset description.

df.information() #basic description of knowledge

We will see that there are 12,44,184 rows within the dataset with no null values.

Now, utilizing 100,000 rows of the info for comfort and feasibility for utilizing LDA mannequin

Exploratory Knowledge Evaluation

Beginning with visualizing the highest 15 phrases within the information with out together with stopwords.

def get_top_n_words(n_top_words, count_vectorizer, text_data):
    returns a tuple of the highest n phrases in a pattern and their 
    accompanying counts, given a CountVectorizer object and textual content pattern
    vectorized_headlines = count_vectorizer.fit_transform(text_data.values)
    vectorized_total = np.sum(vectorized_headlines, axis=0)
    word_indices = np.flip(np.argsort(vectorized_total)[0,:], 1)
    word_values = np.flip(np.type(vectorized_total)[0,:],1)
    word_vectors = np.zeros((n_top_words, vectorized_headlines.form[1]))
    for i in vary(n_top_words):
        word_vectors[i,word_indices[0,i]] = 1

    phrases = [word[0].encode('ascii').decode('utf-8') for 
             phrase in count_vectorizer.inverse_transform(word_vectors)]
    return (phrases, word_values[0,:n_top_words].tolist()[0])
# CountVectorizer perform maps phrases to a vector house with related phrases nearer collectively
count_vectorizer = CountVectorizer(max_df=0.8, min_df=2,stop_words="english")
phrases, word_values = get_top_n_words(n_top_words=15,

fig, ax = plt.subplots(figsize=(16,8)), word_values);
ax.set_xticklabels(phrases, rotation='vertical');
ax.set_title('Prime phrases in headlines dataset (excluding cease phrases)');
ax.set_ylabel('Variety of occurences');
top words in headlines dataset | topic modeling with ML

Now, doing a part of speech tagging for the headlines.

import nltk

tagged_headlines = [TextBlob(reindexed_data[i]).pos_tags for i in vary(reindexed_data.form[0])]
tagged_headlines[10] #checking the tenth headline
tagged_headlines_df = pd.DataFrame({'tags':tagged_headlines})

word_counts = [] 
pos_counts = {}

for headline in tagged_headlines_df[u'tags']:
    for tag in headline:
        if tag[1] in pos_counts:
            pos_counts[tag[1]] += 1
            pos_counts[tag[1]] = 1
print('Complete variety of phrases: ', np.sum(word_counts))
print('Imply variety of phrases per headline: ', np.imply(word_counts))


Complete variety of phrases: 8166553

Imply variety of phrases per headline: 6.563782366595294

Checking if the distribution is regular.

y = stats.norm.pdf(np.linspace(0,14,50), np.imply(word_counts), np.std(word_counts))

fig, ax = plt.subplots(figsize=(8,4))
ax.hist(word_counts, bins=vary(1,14), density=True);
ax.plot(np.linspace(0,14,50), y, 'r--', linewidth=1);
ax.set_title('Headline phrase lengths');
ax.set_xlabel('Variety of phrases');
headline word lengths | bar chart | topic modeling with ML

Visualizing the proportion of prime 5 used components of speech.

# importing libraries
import matplotlib.pyplot as plt
import seaborn as sns
# declaring information
pos_sorted_types = sorted(pos_counts, key=pos_counts.__getitem__, reverse=True)
pos_sorted_counts = sorted(pos_counts.values(), reverse=True)
top_five = pos_sorted_types[:5]
information = pos_sorted_counts[:5]
# declaring exploding pie
explode = [0, 0.1, 0, 0, 0]
# outline Seaborn colour palette to make use of
palette_color = sns.color_palette('darkish')
# plotting information on chart
plt.pie(information, labels=top_five, colours=palette_color, explode=explode,
# displaying chart
pie chart | topic modeling with ML

Right here, it’s seen that fifty% of the phrases in headlines are Noun which sounds affordable.


First, sampling 100,000 healines and changing sentences to phrases.

def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  
text_sample = reindexed_data.pattern(n=100000, random_state=0).values
information = text_sample.tolist()
data_words = listing(sent_to_words(information))


Making bigram and trigram fashions.

# Construct the bigram and trigram fashions
bigram = gensim.fashions.Phrases(data_words, min_count=5, threshold=100) 
trigram = gensim.fashions.Phrases(bigram[data_words], threshold=100)  
# greater threshold fewer phrases.
# Sooner technique to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.fashions.phrases.Phraser(bigram)
trigram_mod = gensim.fashions.phrases.Phraser(trigram)

We’ll do Stopwords elimination, bigrams and trigrams and lemmatization on this step.

import nltk
from nltk.corpus import stopwords

stop_words = stopwords.phrases('english')
stop_words.prolong(['from', 'subject', 're', 'edu', 'use'])

# Outline features for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words]
                                                               for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    texts_out = []
    for despatched in texts:
        doc = nlp(" ".be part of(despatched)) 
        texts_out.append([token.lemma_ for token in doc 
                                     if token.pos_ in allowed_postags])
    return texts_out
# !python -m spacy obtain en_core_web_sm
import spacy

# Take away Cease Phrases
data_words_nostops = remove_stopwords(text_sample)

# Type Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' mannequin, protecting solely tagger element (for effectivity)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization protecting solely noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, 
                             allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Time period Doc Frequency
corpus = [id2word.doc2bow(text) for text in texts]

Subject Modeling

Making use of LDA mannequin assuming 15 themes in entire dataset

num_topics = 15

lda_model = gensim.fashions.LdaMulticore(corpus=corpus,

Subject Interpretation

from pprint import pprint

# Print the Key phrase within the 15 subjects
doc_lda = lda_model[corpus]

  '0.046*"new" + 0.034*"fire" + 0.020*"year" + 0.018*"ban" + 0.016*"open" + '
  '0.014*"set" + 0.011*"consider" + 0.009*"security" + 0.009*"name" + '
  '0.021*"urge" + 0.020*"attack" + 0.016*"government" + 0.014*"lead" + '
  '0.014*"driver" + 0.013*"public" + 0.011*"want" + 0.010*"rise" + '
  '0.010*"student" + 0.010*"funding"'),
  '0.019*"day" + 0.015*"flood" + 0.013*"go" + 0.013*"work" + 0.011*"fine" + '
  '0.010*"launch" + 0.009*"union" + 0.009*"final" + 0.007*"run" + '
  '0.023*"australian" + 0.023*"crash" + 0.016*"health" + 0.016*"arrest" + '
  '0.013*"fight" + 0.013*"community" + 0.013*"job" + 0.013*"indigenous" + '
  '0.012*"victim" + 0.012*"support"'),
  '0.024*"face" + 0.022*"nsw" + 0.018*"council" + 0.018*"seek" + 0.017*"talk" '
  '+ 0.016*"home" + 0.012*"price" + 0.011*"bushfire" + 0.010*"high" + '
  '0.068*"police" + 0.019*"car" + 0.015*"accuse" + 0.014*"change" + '
  '0.013*"road" + 0.010*"strike" + 0.008*"safety" + 0.008*"federal" + '
  '0.008*"keep" + 0.007*"problem"'),
  '0.042*"call" + 0.029*"win" + 0.015*"first" + 0.013*"show" + 0.013*"time" + '
  '0.012*"trial" + 0.012*"cut" + 0.009*"review" + 0.009*"top" + 0.009*"look"'),
  '0.027*"take" + 0.021*"make" + 0.014*"farmer" + 0.014*"probe" + '
  '0.011*"target" + 0.011*"rule" + 0.008*"season" + 0.008*"drought" + '
  '0.007*"confirm" + 0.006*"point"'),
  '0.047*"say" + 0.026*"water" + 0.021*"report" + 0.020*"fear" + 0.015*"test" '
  '+ 0.015*"power" + 0.014*"hold" + 0.013*"continue" + 0.013*"search" + '
  '0.024*"warn" + 0.020*"worker" + 0.014*"end" + 0.011*"industry" + '
  '0.011*"business" + 0.009*"speak" + 0.008*"stop" + 0.008*"regional" + '
  '0.007*"turn" + 0.007*"park"'),
  '0.050*"man" + 0.035*"charge" + 0.017*"jail" + 0.016*"murder" + '
  '0.016*"woman" + 0.016*"miss" + 0.016*"get" + 0.014*"claim" + 0.014*"school" '
  '+ 0.011*"leave"'),
  '0.024*"find" + 0.015*"push" + 0.015*"drug" + 0.014*"govt" + 0.010*"labor" + '
  '0.008*"state" + 0.008*"investigate" + 0.008*"threaten" + 0.008*"mp" + '
  '0.028*"court" + 0.026*"interview" + 0.025*"kill" + 0.021*"death" + '
  '0.017*"die" + 0.015*"national" + 0.014*"hospital" + 0.010*"pay" + '
  '0.009*"announce" + 0.008*"rail"'),
  '0.020*"help" + 0.017*"boost" + 0.016*"child" + 0.016*"hit" + 0.016*"group" '
  '+ 0.013*"case" + 0.011*"fund" + 0.011*"market" + 0.011*"appeal" + '
  '0.036*"plan" + 0.021*"back" + 0.015*"service" + 0.012*"concern" + '
  '0.012*"move" + 0.011*"centre" + 0.010*"inquiry" + 0.010*"budget" + '
  '0.010*"law" + 0.009*"remain"')]


1. Calculating Coherence rating (ranges between -1 and 1), which is a measure of how related the phrases in a subject are.

from gensim.fashions import CoherenceModel

# Compute Coherence Rating
coherence_model_lda = CoherenceModel(mannequin=lda_model, texts=data_lemmatized,
                                    dictionary=id2word, coherence="c_v")
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Rating: ', coherence_lda)


Coherence Rating: 0.38355488160129025

2. Calculating perplexity rating that may be a measure of randomness within the mannequin and the way effectively the chance distribution predicts the pattern. (decrease worth signifies higher mannequin)

perplexity = lda_model.log_perplexity(corpus)




We will see that the coherence rating is pretty low however can nonetheless predict related themes effectively and may absolutely be improved by doing hyperparameter tuning. Additionally, perplexity is low which may be justified with the traditional distribution of the info as was seen in exploratory information evaluation part.


Subject Modeling is an unsupervised studying approach to establish themes in giant units of knowledge. It’s helpful in varied domains comparable to healthcare, finance, and advertising and marketing, the place there’s a large quantity of text-based information to research. On this mission, you needed to apply matter modeling to a dataset referred to as “1,000,000 headlines” consisting of over a million information article headlines printed by the ABC. The goal is to make use of Latent Dirichlet Allocation (LDA) algorithm, which is a probabilistic generative mannequin, to establish the primary subjects within the dataset.

The mission plan includes a number of steps: exploratory information evaluation to grasp the info distribution, preprocessing the textual content by eradicating cease phrases, punctuation, and so forth., and making use of methods like tokenization, stemming, and lemmatization. The essence of the mission revolves round matter modeling, leveraging LDA to establish the first subjects and themes throughout the information headlines. We analyze related phrases and phrases to interpret the subjects and assign human-readable labels to them. The analysis of matter modeling algorithms encompasses metrics comparable to coherence rating and perplexity, whereas additionally considering the constraints of the strategy.

Key Takeaways

  • Subject Modeling is an efficient method of discovering broad themes from the info with Machine Studying (ML) with out labels.
  • It has a variety of functions from healthcare to recommender programs.
  • LDA is one efficient method of implementing matter modeling.
  • Coherence rating and perplexity are efficient analysis metrics for checking the efficiency of matter modeling by means of ML fashions.

The media proven on this article will not be owned by Analytics Vidhya and is used on the Writer’s discretion.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button