Constructing ML Mannequin to Distinguish If It is Human or ChatGPT


Have you ever ever questioned methods to inform if a textual content was written by a human or an AI? Do you assume you can inform the distinction? ‘Are you able to inform me in regards to the historical past of the Kohinoor (Koh-i-Noor) Diamond?’. Attempt to guess who generated this query: A human or an AI. With the advances in pure language era, it’s changing into tougher and tougher to inform the distinction. On this article, we’ll construct a machine-learning mannequin that can inform if a human or ChatGPT generated the textual content. Sounds attention-grabbing, proper? Let’s get began.

Supply: Vine Themes

This text was revealed as part of the Data Science Blogathon.

Desk of Contents


ChatGPT is a conversational AI or machine studying system. It listens, learns, and challenges. This technique takes a pure language textual content as enter and offers responses. It’s similar to you’re speaking with an individual with extra intelligence than you who can reply nearly all of your queries. Do you even think about that we are going to get one thing like this sooner or later? This ChatGPT was developed by OpenAI and launched in November 2022. It makes use of an iterative course of that considers the context of person enter, takes from beforehand considered conversations, and makes use of its collective understanding of these conversations to supply a brand new response. It makes use of human suggestions to reduce dangerous, untruthful, or biased solutions.

ChatGPT by OpenAI
Supply: Investing Information Community

Now let’s see some examples to know how this ChatGPT responds.

Person: What do you consider pineapple on pizza?

ChatGPT: I believe pineapple on pizza is an attention-grabbing mixture of flavors. Some individuals could prefer it and a few could not. It relies on private style.

Person: I like cats greater than canines.

ChatGPT: I see. Why do you want cats greater than canines?

Person: I believe local weather change is a major problem.

ChatGPT: I agree with you. Local weather change is a major problem that impacts everybody on this planet.

Let’s begin constructing our mannequin.

Drawback Assertion: We’ll construct a machine-learning mannequin that may classify textual content as human-written or ChatGPT generated. It ought to have the ability to deal with a number of forms of texts like questions, essays, tales, jokes, code, and so on.

Dataset Description

  • Right here is the hyperlink to the dataset that I’ve used on this mission: Dataset
  • This dataset incorporates 4 columns: textual content, paraphrases, class, and supply.
  • These texts had been taken from totally different sources.
  • Every of those texts was paraphrased to provide 5 extra texts.
  • It has 419197 rows and 4 columns.

Let’s begin by importing primary vital machine studying libraries comparable to numpy and pandas.

import numpy as np
import pandas as pd

Use pandas to load the dataset and generate an information body.


#(419197, 4)
Machine learning model building | ChatGPT

Let’s view one textual content and corresponding its paraphrase to know the dataset. Within the output, we will see the textual content is a query asking for the story of the Kohinoor diamond. The identical sentence was paraphrased in 5 alternative ways by ChatGPT.

Machine learning model building

Create a dictionary with texts and their corresponding classes like figuring out whether or not the textual content is generated by people or ChatGPT. It will likely be categorized based mostly on whether or not it’s a textual content or a paraphrase. If it’s a textual content then it’s human-generated else if it’s a paraphrase then it’s ChatGPT generated. In case you view this class, you’ll get a dictionary as proven within the picture.

for i in vary(len(df)):
    chatgpt=df.iloc[i]["paraphrases"][1:-1].break up(', ')
    for j in chatgpt[:1]:
Machine learning model building | ChatGPT

Convert this class dictionary into an information body utilizing pandas. Create two columns as textual content and class the place the class has two distinctive values comparable to human and ChatGPT. Shuffle all of the rows within the knowledge body to keep away from overfitting. We’ll take the primary 20000 rows after shuffling to make it simple. Let’s view the information body created.


View the distinctive values of people and ChatGPT. There are 10340 texts generated by people and 9660 texts generated by ChatGPT.


Take two array variables X and Y. X may have a textual content column of the information body and Y may have a class column of the information body. This mainly is the enter and output for the mannequin. We’ll give X as enter and it’ll predict Y as output.


Cut up the dataset into practice and check datasets

Subsequent break up the complete dataset utilizing train_test_split into X_train, X_test, y_train, and y_test for additional processing.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Let’s vectorize them utilizing a TF-IDF vectorizer. It’s a numerical statistic that displays how necessary a phrase is to a doc in a group or corpus. In data retrieval and textual content mining, it’s ceaselessly employed as a weighting issue. For this import TfidfVectorizer.

from sklearn.feature_extraction.textual content import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.remodel(X_test)

Choosing Greatest Classifier

As a substitute of taking one particular classifier and constructing it, let’s take some set of classifiers and calculate their accuracy rating and f1 rating to know the most effective classifier amongst them. Right here we used logistic regression, assist vector classifier, determination tree classifier, voting classier, KNN classifier, Random forest, Further timber, Adaboost, bagging classifier, and gradient boosting classifier.

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
lg = LogisticRegression(penalty='l1',solver="liblinear")
sv = SVC(kernel="sigmoid",gamma=1.0)
mnb = MultinomialNB()
dtc = DecisionTreeClassifier(max_depth=5)
knn = KNeighborsClassifier()
rfc = RandomForestClassifier(n_estimators=50,random_state=2)
and so on = ExtraTreesClassifier(n_estimators=50,random_state=2)
abc = AdaBoostClassifier(n_estimators=50,random_state=2)
bg = BaggingClassifier(n_estimators=50,random_state=2)
gbc = GradientBoostingClassifier(n_estimators=50,random_state=2)

Now calculate the accuracy rating and f1 rating of those classifiers.

def prediction(mannequin,X_train,X_test,y_train,y_test):
    pr = mannequin.predict(X_test)
    acc_score = metrics.accuracy_score(y_test,pr)
    f1= metrics.f1_score(y_test,pr,common="binary", pos_label="chatgpt")
    return acc_score,f1
acc_score = {}
clfs= {
    'ETC':and so on,
for title,clf in clfs.objects():
    acc_score[name],f1_score[name]= prediction(clf,X_train_tfidf,X_test_tfidf,y_train,y_test)
    #View these scores
Machine learning model building | ChatGPT

If we evaluate these scores we received the very best for ExtraTrees Classifier (ETC). So we’ll use this additional timber classifier to coach the mannequin and predict the longer term. For that practice the mannequin utilizing the match methodology.

Further Bushes Classifier: An additional timber classifier is a kind of ensemble studying methodology that makes use of a number of randomized determination timber to enhance the predictive accuracy and management over-fitting. Additionally it is known as an Extraordinarily Randomized Bushes Classifier. It differs from basic determination timber in the best way they’re constructed. Regular determination timber have some drawbacks like overfitting and excessive variance. So to keep away from them totally different ensemble studying strategies are launched. The additional timber classifier is one amongst them. As a substitute of utilizing bootstrap samples, Further Bushes Classifier makes use of the entire unique dataset for every tree, however with a random sampling of options for every break up. And in addition as an alternative of discovering the optimum break up level for every characteristic, Further Bushes Classifier randomly selects a break up level from a uniform distribution throughout the characteristic’s vary. This can be a highly effective machine studying method that may deal with advanced classification issues with excessive accuracy and low overfitting.

and so on.match(X_train_tfidf,y_train)

Confusion Matrix

Predict the check dataset and get a confusion matrix. This confusion matrix is mainly a desk that defines the efficiency of the algorithm. Right here for our classification drawback, it can give 4 values. False Optimistic (FP), True Optimistic (TP), False Unfavourable (FN), and True Unfavourable (TN) are the 4 values that it’s going to give. We will even plot this confusion matrix. For that import matplotlib and seaborn for visualizations.

from sklearn.metrics import confusion_matrix
y_pred =and so on.predict(X_test_tfidf)
cm = confusion_matrix(y_test, y_pred)

import seaborn as sn
import matplotlib.pyplot as plt
confusion_matrix = pd.DataFrame(cm, index = [i for i in ["ChatGPT","Human"]],
                  columns = [i for i in ["ChatGPT","Human"]])
plt.determine(figsize = (20,14))
sn.heatmap(confution_matrix, annot=True,cmap="YlGnBu", fmt="g")
Machine learning model building

Now let’s see the expected outcomes by the mannequin. Within the earlier step, We’ve got predicted and saved the leads to y_pred. This y_pred is an array. So convert it into an information body utilizing pandas and rename the column from 0 to ‘class predicted’ and look at it.

y_preddf.rename(columns={0:'class predicted'},inplace=True)
Machine learning model building


To check precise outcomes and predicted outcomes, let’s be a part of the information frames. And examine the information body. Right here I considered rows from 20 to 30. Within the outcomes, you’ll be able to see nearly all are predicted accurately whereas some are incorrectly predicted. For the twenty third textual content, our mannequin predicted that it was generated by a human however in reality, it was generated by ChatGPT.

x_testdf['id'] = vary(1, len(x_testdf) + 1)
y_testdf['id'] = vary(1, len(y_testdf) + 1)
y_preddf['id'] = vary(1, len(y_preddf) + 1)
join1=y_testdf.merge(x_testdf, how = 'inside' ,indicator=False)
join_df=join1.merge(y_preddf, how = 'inside' ,indicator=False)
Machine learning model building

Discovering Accuracy

So these incorrectly categorized texts are false positives and false negatives. Let’s see the accuracy of our mannequin. We received an accuracy of 78.7%. This may be improved through the use of a lot of rows whereas coaching and by growing the variety of epochs. We had taken solely 20000 rows and 10 epochs for our experiment.



Now it’s time to check our mannequin with our personal texts as an alternative of texts from the dataset. I’ve given 4 totally different texts. First, first two texts gave people regardless that I used to be given ChatGPT within the second instance. This tells us how precisely our mannequin is predicting. Within the third instance, I used some phrases like a step-by-step information, and within the final instance I offered a textual content like recommending some web sites that resemble ChatGPT and our mannequin additionally predicted it as ChatGPT.

enter=['Hello!! This is Amrutha']
and so on.predict(vect_input)

#array(['human'], dtype=object)

enter=['Hello!! This is chatgpt']
and so on.predict(vect_input)

#array(['human'], dtype=object)

enter=['Can you please provide a step by step guide for writing articles on analytics vidhya']
and so on.predict(vect_input)

#array(['chatgpt'], dtype=object)

enter=['These are the websites for watching movies that I can recommend you']
and so on.predict(vect_input)

#array(['chatgpt'], dtype=object)


These days the phrase that we’re listening to extra typically is ChatGPT. That is actually getting used all over the place together with by college students, engineers, writers, academics, and all. Due to the potential it has in producing great solutions to their queries. These ChatGPT-generated texts precisely sound like people and we people can’t inform the distinction simply by these texts. So to differentiate the texts generated by ChatGPT we have now constructed a mannequin which efficiently tells both ‘human’ or ‘ChatGPT’.

  • ChatGPT is an AI language mannequin that may generate human-like textual content on varied matters and duties.
  • We’ve got used a dataset with a very good variety of each human and ChatGPT-generated texts to coach our mannequin.
  • We’ve got chosen the most effective classifier which is the additional timber classifier to coach our mannequin.
  • We should be cautious of the moral and social implications of ChatGPT and the potential of false positives and false negatives in our machine studying mannequin.

Hope you discovered this text helpful. Join with me on LinkedIn.

The media proven on this article isn’t owned by Analytics Vidhya and is used on the Creator’s discretion. 

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button