Exploring Analysis on Gender Equality with NLP and Elicit

Introduction
NLP (Pure Language Processing) can assist us to grasp big quantities of textual content knowledge. As an alternative of going via an enormous quantity of paperwork by hand and studying them manually, we are able to make use of those methods to hurry up our understanding and get to the principle messages rapidly. On this blogpost, we dive into the opportunity of utilizing panda knowledge frames and NLP instruments in Python to get an concept of what individuals wrote about when doing analysis on gender equality in Afghanistan utilizing Elicit. These insights may assist us to grasp what labored and what didn’t work to advance gender equality over the last a long time in a rustic that’s thought-about to be some of the troublesome locations to be a girl or woman (World Financial Discussion board, 2023).
Studying Goal
- Achieve proficiency in textual content evaluation for textual content in CSV recordsdata.
- Purchase data on the way to do pure language processing in Python.
- Develop expertise in efficient knowledge visualization for communication.
- Achieve insights on how analysis on gender equality in Afghanistan advanced over time.
This text was revealed as part of the Data Science Blogathon.
Utilizing Elicit for Literature Critiques
To generate the underlying knowledge, I exploit Elicit, a AI-powered instruments for Literature Critiques (Elicit). I ask the device to generate a listing of papers associated to the query: Why did gender equality fail in Afghanistan? I then obtain a ensuing record of papers (I think about a random variety of greater than 150 papers) in CSV format. How does this knowledge appear like? Let’s take a look!
Analyzing CSV Information from Elicit in Python
We’ll first learn within the CSV file as a pandas dataframe:
import pandas as pd
#Establish path and csv file
file_path="./elicit.csv"
#Learn in CSV file
df = pd.read_csv(file_path)
#Form of CSV
df.form
#Output: (168, 15)
#Present first rows of dataframe
df.head()
The df.head() command reveals us the primary rows of the ensuing pandas dataframe. The dataframe consists of 15 columns and 168 rows. We generate this info with the df.form command. Let’s first discover by which 12 months most of those research have been revealed. To discover this, we are able to reap the benefits of the column reporting the every paper’s 12 months of publication. There are a number of instruments to generate figures in Python, however let’s depend on the seaborn and matplotlib library right here. To research by which 12 months papers have been principally revealed, we are able to reap the benefits of a so-called countplot, and in addition costumize the axis labels and axis ticks to make it look properly:
Analyzing the Well timed Distribution of Printed Papers
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Set determine dimension
plt.determine(figsize=(10,5))
#Producte a countplot
chart = sns.countplot(x=df["Year"], coloration="blue")
#Set labels
chart.set_xlabel('12 months')
chart.set_ylabel('Variety of revealed papers')
#Change dimension of xticks
# get label textual content
_, xlabels = plt.xticks()
# set the x-labels with
chart.set_xticklabels(xlabels, dimension=5)
plt.present()
The information reveals that the variety of papers elevated over time, in all probability additionally on account of better knowledge availability and higher prospects to do analysis in Afghanistan after the Taliban ceased energy in 2001.
Analyzing the Content material of Papers
Variety of Phrases Written
Whereas this offers us a primary perception on analysis carried out on gender equality in Afghanistan, we’re principally concerned with what researchers truly wrote about. To get an concept of the content material of those papers, we are able to reap the benefits of the summary, which Elicit kindly included for us within the CSV file the device generated for us. To do that, we are able to comply with commonplace procedures for textual content evaluation, such because the one outlined by Jan Kirenz in one among his blogposts. We begin by merely counting the variety of phrases in every summary by utilizing a lambda technique:
#Cut up textual content of abstracts into a listing of phrases and calculate the size of the record
df["Number of Words"] = df["Abstract"].apply(lambda n: len(n.cut up()))
#Print first rows
print(df[["Abstract", "Number of Words"]].head())
#Output:
Summary Variety of Phrases
0 As a standard society, Afghanistan has alwa... 122
1 The Afghanistan gender inequality index reveals ... 203
2 Cultural and non secular practices are crucial ... 142
3 ABSTRACT Gender fairness is usually a uncared for issu... 193
4 The collapse of the Taliban regime within the latt... 357
#Describe the column with the variety of phrases
df["Number of Words"].describe()
depend 168.000000
imply 213.654762
std 178.254746
min 15.000000
25% 126.000000
50% 168.000000
75% 230.000000
max 1541.000000
Nice. Many of the abstracts appear to be wealthy in phrases. They’ve on common 213.7 phrases. The minimal summary solely consists of 15 phrases, nevertheless, whereas the utmost summary has 1,541 phrases.
What do Researchers Write About?
Now that we all know that the majority abstracts are wealthy of knowledge, let’s ask what they principally write about. We are able to accomplish that by making a frequency distribution for every phrase written. Nevertheless, we aren’t concerned with sure phrases, reminiscent of stopwords. Consequently, we have to do some textual content processing:
# First, rework all to decrease case
df['Abstract_lower'] = df['Abstract'].astype(str).str.decrease()
df.head(3)#import csv
# Let's tokenize the column
from nltk.tokenize import RegexpTokenizer
regexp = RegexpTokenizer('w+')
df['text_token']=df['Abstract_lower'].apply(regexp.tokenize)
#Present the primary rows of the brand new dataset
df.head(3)
# Take away stopwords
import nltk
nltk.obtain('stopwords')
from nltk.corpus import stopwords
# Make a listing of english stopwords
stopwords = nltk.corpus.stopwords.phrases("english")
# Lengthen the record with your personal customized stopwords
my_stopwords = ['https']
stopwords.lengthen(my_stopwords)
# Take away stopwords with lampda operate
df['text_token'] = df['text_token'].apply(lambda x: [item for item in x if item not in stopwords])
#Present the primary rows of the dataframe
df.head(3)
# Take away rare phrases (phrases shorter than or equal to 2 letters)
df['text_string'] = df['text_token'].apply(lambda x: ' '.be a part of([item for item in x if len(item)>2]))
#Present the primary rows of the dataframe
df[['Abstract_lower', 'text_token', 'text_string']].head()
What we do right here is to first rework all phrases into decrease case, and to tokenize them later through pure language processing instruments. Phrase tokenization is an important step in pure language processing and means splitting down textual content into particular person phrases (tokens). We use the RegexpTokenizer and tokenize the textual content of our abstracts based mostly on alphanumeric traits (denoted through ‘w+’). Retailer the ensuing tokens within the column text_token. We then take away stopwords from this record of tokens by utilizing the dictionary of the pure language processing toolbox of nltk, the Python NLTK (Pure Language Toolkit) library. Delete phrases which are shorter than two letters. Any such textual content processing helps us to focus our evaluation on extra significant phrases.
Generate Phrase Cloud
To visually analyze the ensuing record of phrases, we generate a listing of strings from the textual content we processed and tokenize this record after which generate a phrase cloud:
from wordcloud import WordCloud
# Create a listing of phrases
all_words=" ".be a part of([word for word in df['text_string']])
# Phrase Cloud
wordcloud = WordCloud(width=600,
peak=400,
random_state=2,
max_font_size=100).generate(all_words)
plt.determine(figsize=(10, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off');

The phrase cloud reveals that these phrases talked about principally are these additionally forming a part of our search question: afghanistan, gender, gender equality. Nevertheless, another phrases which are substitutes additionally kind a part of the record of most talked about phrases: men and women. These phrases per se are usually not very informative, however some others are: inside the analysis on gender equality in Afghanistan, researchers appear to be very involved about training, human rights, society, and the state. Surprisingly, Pakistan additionally types a part of the record. This might imply that outcomes generated to the search question are imprecise and in addition included analysis on gender equality on Afghanistan, though we didn’t ask for it. Alternatively, they might imply that gender equality of Afghan ladies can be an vital analysis matter in Pakistan, possibly on account of many Afghans settling in in Pakistan on account of the troublesome state of affairs of their residence nation.
Analyze the Sentiment of Authors
Ideally, analysis can be impartial and freed from feelings or opinions. Nevertheless, it’s inside our human nature to have opinions and sentiments. To analyze to which extent researchers replicate their very own sentiments in what they write about, we are able to do a sentiment evaluation. Sentiment analyses are strategies to investigate if a set of textual content is optimistic, impartial, or damaging. In our instance, we are going to use the VADER Sentiment Evaluation Instrument. VADER stands for Valence Conscious Dictionary and Sentiment Reasoner, and is a lexicon and rule-based sentiment evaluation device.
How the VADER sentiment evaluation device works, is that it makes use of a pre-built sentiment lexicon that consists of an enormous variety of phrases with related sentiments. It additionally considers grammatical guidelines to detect the sentiment polarity (optimistic, impartial, and damaging) of brief texts. The device leads to a sentiment rating (additionally referred to as the compound rating) based mostly on the sentiment of every phrase and the grammatical guidelines within the textual content. This rating ranges from -1 to 1. Values above zero are optimistic and values under zero are damaging. For the reason that device depends on a prebuilt sentiment lexicon, it doesn’t require complicated machine studying fashions or intensive fashions.
# Entry to the required lexicon containing sentiment scores for phrases
nltk.obtain('vader_lexicon')
# Initializes the sentiment analyzer object
from nltk.sentiment import SentimentIntensityAnalyzer
#Calculate the sentiment polarity scores with analyzer
analyzer = SentimentIntensityAnalyzer()
# Polarity Rating Methodology - Assign outcomes to Polarity Column
df['polarity'] = df['text_string'].apply(lambda x: analyzer.polarity_scores(x))
df.tail(3)
# Change knowledge construction - Concat unique dataset with new columns
df = pd.concat(
[df,
df['polarity'].apply(pd.Sequence)], axis=1)
#Present construction of recent column
df.head(3)
#Calculate imply worth of compound rating
df.compound.imply()
#Output: 0.20964702380952382
The code above generates a polarity rating that ranges from -1 to 1 for every summary, right here denoted because the compound rating. The imply worth is above zero, so many of the analysis has a optimistic connotation. How did this alteration over time? We are able to merely plot the emotions by 12 months:
# Lineplot
g = sns.lineplot(x='12 months', y='compound', knowledge=df)
#Regulate labels and title
g.set(title="Sentiment of Summary")
g.set(xlabel="12 months")
g.set(ylabel="Sentiment")
#Add a gray line to point zero (the impartial rating) to divide optimistic and damaging scores
g.axhline(0, ls="--", c="gray")

Fascinating. Many of the analysis was optimistic ranging from 2003 onwards. Earlier than that, sentiments fluctuates extra considerably and have been extra damaging, on common, in all probability as a result of troublesome state of affairs of girls in Afghanistan.
Conclusion
Pure Language Processing can assist us to generate helpful insights into massive quantities of textual content. What we realized right here from almost 170 papers is that training and human rights have been crucial matters within the analysis papers gathered by Elicit, and that researchers began to jot down extra positively about gender equality in Afghanistan from 2003 onwards, shortly after the Taliban ceased energy in 2001.
Key Takeaways
- We are able to use Pure Language Processing Instruments to realize fast insights into the principle matters studied in a sure analysis subject.
- Phrase Clouds are nice visualization instruments to get an understanding of probably the most generally used phrases in a textual content.
- Sentiment Evaluation reveals that analysis may not be as impartial as anticipated.
I hope you discovered this text informative. Be at liberty to achieve out to me on LinkedIn. Let’s join and work in direction of leveraging knowledge for good!
Steadily Requested Questions
A. Elicit is a web based platform designed to assist researchers in finding AI papers and analysis knowledge. By merely posing a analysis query, Elicit leverages its huge database of 175 million articles to uncover related solutions. Furthermore, it supplies the performance to make the most of Elicit for analyzing your personal papers. Moreover, Elicit boasts a user-friendly interface, guaranteeing easy navigation and accessibility.
A. Pure Language Processing (NLP) is a specialised department inside the subject of synthetic intelligence (AI). Its major goal is to allow machines to understand and analyze human language, permitting them to automate varied repetitive duties. Some frequent functions of NLP embody machine translation, summarization, ticket classification, and spellchecking.
A. There are a number of approaches to calculating a sentiment rating, however the broadly used technique entails using a dictionary of phrases categorized as damaging, impartial, or optimistic. The textual content is subsequently examined to find out the presence of damaging and optimistic phrases, permitting for an estimation of the general sentiment conveyed by the textual content.
A. The compound rating is derived by including up the valence scores of particular person phrases within the lexicon, taking into consideration relevant guidelines, and subsequently normalizing the rating to vary between -1 (indicating extremely damaging sentiment) and +1 (indicating extremely optimistic sentiment). This metric is especially helpful when searching for a singular, one-dimensional measure of sentiment.
References
The media proven on this article shouldn’t be owned by Analytics Vidhya and is used on the Creator’s discretion.