AI

The way to Construct a PDF Chatbot With out Langchain?

Introduction

For the reason that launch of Chatgpt, the tempo of progress within the AI area reveals no indicators of slowing down, new instruments and applied sciences are being developed on daily basis. Certain, It’s an awesome factor for companies and the AI area basically, however as a programmer, do that you must be taught all of them to construct one thing? Nicely, the reply is No. A relatively pragmatic strategy to this could be to study issues that you just want. There are a variety of instruments and applied sciences that promise to make issues simpler, and to some extent they do. But additionally at occasions, we don’t want them in any respect. Utilizing massive frameworks for easy use instances solely finally ends up making your code a bloated mess. So, on this article, we’re going to discover by constructing a CLI PDF chatbot with out langchain and perceive why we don’t all the time want AI frameworks.

Studying Aims

  • Why you don’t want AI frameworks like Langchain, and Llama Index.
  • While you want frameworks.
  • Study Vector Databases and Indexing.
  • Construct a CLI Q&A chatbot from scratch in Python

This text was printed as part of the Data Science Blogathon.

Are you able to do with out Langchain?

Over the current months, frameworks reminiscent of Langchain and LLama Index have skilled a outstanding surge in recognition, primarily resulting from their distinctive capability to facilitate handy growth of LLM apps by builders. However for lots of usecases these frameworks may grow to be overkill. It’s like bringing a bazooka to a gun struggle.

They ship with issues that will not be required in your mission. Python is already notorious for being bloated. On prime of that, including dependencies that you just hardly want will solely make your surroundings messier. One such use case is doc querying. In case your mission doesn’t contain an AI agent or different such difficult stuff, you possibly can ditch Langchain and make the workflow from scratch, thus lowering pointless bloat. Apart from this, Langchain or Llama Index-like frameworks are below fast growth; any code refactoring may break your construct.

When do you Want Langchain?

When you’ve got an increased order want reminiscent of constructing an Agent to automate difficult software program, or initiatives that require longer engineering hours to construct from scratch, it is smart to make use of prebuilt options. By no means reinvent the wheel, until you want a greater wheel. There are different such numerous examples the place utilizing readymade options with minor tweaks makes absolute sense.

Constructing a QA Chatbot

Probably the most sought-after use instances of LLMs has been Doc query and answering. And after OpenAI made their ChatGPT endpoints public, it has grow to be a lot simpler to construct an interactive conversational bot with any textual content information sources. On this article, we’ll construct an LLM Q&A CLI app from scratch. So, how will we strategy the issue? Earlier than constructing it let’s perceive what we have to do.

A typical workflow will contain

  • Processing the supplied PDF file to extract texts.
  • We additionally must be cautious in regards to the context window of the LLM. So, we have to make chunks of these texts.
  • To question related chunks of textual content, we have to get embeddings of these textual content chunks. For this, we want an embedding mannequin. For this mission, we’ll use the Huggingface MiniLM-L6-V2 mannequin, you possibly can go together with any mannequin you want reminiscent of OpenAI, Cohere, or Google Palm.
  • For storing and retrieving embeddings, we’ll use a Vector database reminiscent of Chroma. There are lots of completely different Vector Databases you possibly can go for reminiscent of Qdrant, Weaviate, Milvus, and lots of extra.
  • When a person sends a question, it should get transformed to embeddings by the identical mannequin, and the chunks with related which means to the question will likely be fetched.
  • The fetched chunks will likely be concatenated with the question on the finish and will likely be fed to the LLM through an API.
  • The fetched reply from the mannequin will likely be returned to the person.

All these items would require a user-facing interface. For this text, we’ll construct a easy Command Line Interface with Python Argparse.

Here’s a workflow diagram of our CLI chatbot:

CLI Chatbot | PDF Chatbot without Langchain

Earlier than going into the coding half, let’s perceive a factor or two about vector Databases and Indexes.

What are Vector Databases and indexes?

Because the identify suggests, vector databases retailer vectors or embeddings.  So, why do we want Vector Databases? Constructing any AI software requires embeddings of real-world information because the Machine studying fashions can not immediately course of these uncooked information reminiscent of texts, photographs, or audio. When you find yourself coping with a considerable amount of this information that will likely be used repeatedly, it should must be saved someplace. So, why can’t we use a conventional database for this? Nicely, you need to use conventional databases to your search wants, however vector databases provide a major benefit: they’ll carry out vector similarity search along with lexical search.

In our case, at any time when a person sends a question, the vector DB will carry out a vector similarity search over all of the embeddings and fetch the Ok nearest neighbors. The search mechanism is superfast because it employs an algorithm known as HNSW.

HNSW stands for Hierarchical Navigable Small World. It’s a graph-based algorithm and indexing methodology for Approximate Nearest Neighbor search (ANN). ANN is a kind of search that finds the okay most related objects to a given merchandise.

HNSW works by constructing a graph of the information factors. The nodes within the graph characterize the information factors, and the perimeters within the graph characterize the similarity between the information factors. The graph is then traversed to search out the okay most related objects to the given merchandise.

The HNSW algorithm is quick, dependable, and scalable. Many of the Vector Databases use HNSW because the default search algorithm.

Now, we’re all set to delve into codes.

Construct Venture Atmosphere

As with all Python mission, begin with making a digital surroundings. This retains the event surroundings good and tidy. Discuss with this text for selecting the best Python surroundings to your mission.

The mission file construction is straightforward, we may have two Python recordsdata one for outlining the CLI and the opposite for processing, storing, and querying information. Additionally, create a .env file to retailer your OpenAI API key.

That is the necessities.txt file set up it earlier than getting began.

#requiremnets.txt
openai
chromadb
PyPDF2
dotenv

Now, import the mandatory lessons and capabilities.

import os
import openai
import PyPDF2
import re
from chromadb import Consumer, Settings
from chromadb.utils import embedding_functions
from PyPDF2 import PdfReader
from typing import Record, Dict
from dotenv import load_dotenv

Load the OpenAI API key from the .env file.

load_dotenv()
key = os.environ.get('OPENAI_API_KEY')
openai.api_key = key

Utility Features for Chatbot CLI

To retailer textual content embeddings and their metadata, we’ll create a set with ChromaDB.

ef = embedding_functions.ONNXMiniLM_L6_V2()
consumer = Consumer(settings = Settings(persist_directory="./", is_persistent=True))
collection_ = consumer.get_or_create_collection(identify="take a look at", embedding_function=ef)

As an embedding mannequin, we’re utilizing MiniLM-L6-V2 with ONNX runtime. It’s small but succesful and on prime of that open-sourced.

Subsequent, we’ll outline a perform to confirm if a supplied file path belongs to a sound PDF file.

def verify_pdf_path(file_path):
    strive:
        # Try and open the PDF file in binary learn mode
        with open(file_path, "rb") as pdf_file:
            # Create a PDF reader object utilizing PyPDF2
            pdf_reader = PyPDF2.PdfReader(pdf_file)
            
            # Examine if the PDF has no less than one web page
            if len(pdf_reader.pages) > 0:
                # If it has pages, the PDF shouldn't be empty, so do nothing (go)
                go
            else:
                # If it has no pages, increase an exception indicating that the PDF is empty
                increase ValueError("PDF file is empty")
    besides PyPDF2.errors.PdfReadError:
        # Deal with the case the place the PDF can't be learn (e.g., it is corrupted or not a sound PDF)
        increase PyPDF2.errors.PdfReadError("Invalid PDF file")
    besides FileNotFoundError:
        # Deal with the case the place the required file would not exist
        increase FileNotFoundError("File not discovered, examine file deal with once more")
    besides Exception as e:
        # Deal with different surprising exceptions and show the error message
        increase Exception(f"Error: {e}")

One of many main components of a PDF Q&A app is to get textual content chunks. So, we have to outline a perform that will get us the required chunks of textual content.

def get_text_chunks(textual content: str, word_limit: int) -> Record[str]:
    """
    Divide a textual content into chunks with a specified phrase restrict 
    whereas making certain every chunk incorporates full sentences.
    
    Parameters:
        textual content (str): All the textual content to be divided into chunks.
        word_limit (int): The specified phrase restrict for every chunk.
    
    Returns:
        Record[str]: An inventory containing the chunks of textual content with 
        the required phrase restrict and full sentences.
    """
    sentences = re.cut up(r'(?<!w.w.)(?<![A-Z][a-z].)(?<=.|?)s', textual content)
    chunks = []
    current_chunk = []

    for sentence in sentences:
        phrases = sentence.cut up()
        if len(" ".be a part of(current_chunk + phrases)) <= word_limit:
            current_chunk.prolong(phrases)
        else:
            chunks.append(" ".be a part of(current_chunk))
            current_chunk = phrases

    if current_chunk:
        chunks.append(" ".be a part of(current_chunk))

    return chunks

Now we have outlined a fundamental algorithm for getting chunks. The thought is to let customers create as many phrases as they need in a single textual content chunk. And each textual content chunk will finish with an entire sentence, even when it breaches the restrict. This can be a easy algorithm. Chances are you’ll create one thing by yourself.

Create a Dictionary

Now, we want a perform to load texts from PDFs and create a dictionary to maintain monitor of textual content chunks belonging to a single web page.

def load_pdf(file: str, phrase: int) -> Dict[int, List[str]]:
    # Create a PdfReader object from the required PDF file
    reader = PdfReader(file)
    
    # Initialize an empty dictionary to retailer the extracted textual content chunks
    paperwork = {}
    
    # Iterate by means of every web page within the PDF
    for page_no in vary(len(reader.pages)):
        # Get the present web page
        web page = reader.pages[page_no]
        
        # Extract textual content from the present web page
        texts = web page.extract_text()
        
        # Use the get_text_chunks perform to separate the extracted textual content into chunks of 'phrase' size
        text_chunks = get_text_chunks(texts, phrase)
        
        # Retailer the textual content chunks within the paperwork dictionary with the web page quantity as the important thing
        paperwork[page_no] = text_chunks
    
    # Return the dictionary containing web page numbers as keys and textual content chunks as values
    return paperwork

ChromaDB Assortment

Now, we have to retailer the information in a ChromaDB assortment.

def add_text_to_collection(file: str, phrase: int = 200) -> None:
    # Load the PDF file and extract textual content chunks
    docs = load_pdf(file, phrase)
    
    # Initialize empty lists to retailer information
    docs_strings = []  # Record to retailer textual content chunks
    ids = []  # Record to retailer distinctive IDs
    metadatas = []  # Record to retailer metadata for every textual content chunk
    id = 0  # Initialize ID
    
    # Iterate by means of every web page and textual content chunk within the loaded PDF
    for page_no in docs.keys():
        for doc in docs[page_no]:
            # Append the textual content chunk to the docs_strings listing
            docs_strings.append(doc)
            
            # Append metadata for the textual content chunk, together with the web page quantity
            metadatas.append({'page_no': page_no})
            
            # Append a singular ID for the textual content chunk
            ids.append(id)
            
            # Increment the ID
            id += 1

    # Add the collected information to a set
    collection_.add(
        ids=[str(id) for id in ids],  # Convert IDs to strings
        paperwork=docs_strings,  # Textual content chunks
        metadatas=metadatas,  # Metadata
    )
    
    # Return successful message
    return "PDF embeddings efficiently added to assortment"

In Chromadb, the metadata discipline shops extra data relating to the paperwork. On this case, the web page variety of a textual content chunk is its metadata. After extracting metadata from every textual content chunk, we will retailer them within the assortment we created earlier. That is required solely when the person supplies a sound file path to a PDF file.

We’ll now outline a perform that processes person queries to fetch information from the database.

def query_collection(texts: str, n: int) -> Record[str]:
    end result = collection_.question(
                  query_texts = texts,
                  n_results = n,
                 )
    paperwork = end result["documents"][0]
    metadatas = end result["metadatas"][0]
    resulting_strings = []
    for page_no, text_chunk in zip(metadatas, paperwork):
        resulting_strings.append(f"Web page {page_no['page_no']}: {text_chunk}")
    return resulting_strings

The above perform makes use of a question methodology to retrieve “n” related information from the database. We then create a formatted string that begins with the web page variety of the textual content chunk.

Now, the one main factor remaining is to feed the LLM with data.

def get_response(queried_texts: Record[str],) -> Record[Dict]:
    world messages
    messages = [
                {"role": "system", "content": "You are a helpful assistant.
                 And will always answer the question asked in 'ques:' and 
                 will quote the page number while answering any questions,
                 It is always at the start of the prompt in the format 'page n'."},
                {"role": "user", "content": ''.join(queried_texts)}
          ]

    response = openai.ChatCompletion.create(
                            mannequin = "gpt-3.5-turbo",
                            messages = messages,
                            temperature=0.2,               
                     )
    response_msg = response.selections[0].message.content material
    messages = messages + [{"role":'assistant', 'content': response_msg}]
    return response_msg

The worldwide variable messages retailer the context of the dialog. Now we have outlined a system message to print the web page quantity from the place the LLM will get the reply.

Lastly, the last word utility perform combines obtained textual content chunks with the person question, feeds it into the get_response() perform, and returns the ensuing reply string.

def get_answer(question: str, n: int):
    queried_texts = query_collection(texts = question, n = n)
    queried_string = [''.join(text) for text in queried_texts]
    queried_string = queried_string[0] + f"ques: {question}"
    reply = get_response(queried_texts = queried_string,)
    return reply

We’re carried out with our utility capabilities. Let’s transfer on to constructing CLI.

Chatbot CLI

To make use of the chatbot on-demand, we want an interface. This might be an online app, a cell app, or a CLI. On this article, we’ll construct a CLI for our chatbot. If you wish to construct a nice-looking demo internet app, you need to use instruments like Gradio or Streamlit. Take a look at this text on constructing a chatbot for PDF.

Construct a ChatGPT for PDFs with Langchain

To construct a CLI, we’ll want the Argparse library. Argparse is a potent library that permits you to create CLIs in Python. It has a easy and simple syntax to create instructions, sub-commands, and flags. So, earlier than delving into it, here’s a small primer on Argparse.

Python Argparse

The Argparse module was first launched in Python 3.2, offering a fast and handy strategy to construct CLI functions with Python with out counting on third-party installations. It permits us to parse command line arguments, create sub-commands in CLIs, and lots of extra options, making it a dependable instrument for constructing CLIs.

Right here’s a small instance of Argparse in motion,

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-f", "--filename", assist="The identify of the file to learn.")
parser.add_argument("-n", "--number", assist="The variety of strains to print.", kind=int)
parser.add_argument("-s", "--sort", assist="Kind the strains within the file.", motion="store_true")

args = parser.parse_args()

with open(args.filename) as f:
    strains = f.readlines()

if args.type:
    strains.type()

for line in strains:
    print(line)

The add_argument methodology lets us outline sub-commands with checks and balances. We are able to outline the kind of argument or the motion it must undertake when a flag is supplied and a assist parameter that explains the use case of a specific sub-command. The assistance subcommand will show all of the flags and their use instances.

On the same observe, we’ll outline sub-commands for the chatbot CLI.

Constructing the CLI

Import Argparse and crucial utility capabilities.

import argparse
from utils import (
    add_text_to_collection, 
    get_answer, 
    verify_pdf_path, 
    clear_coll
  )

Outline Argument parser and add arguments.

def essential():
    # Create a command-line argument parser with an outline
    parser = argparse.ArgumentParser(description="PDF Processing CLI Device")
    
    # Outline command-line arguments
    parser.add_argument("-f", "--file", assist="Path to the enter PDF file")
    
    parser.add_argument(
        "-c", "--count",
        default=200, 
        kind=int, 
        assist="Non-obligatory integer worth for the variety of phrases in a single chunk"
    )
    
    parser.add_argument(
        "-q", "--question", 
        kind=str,
        assist="Ask a query"
    )
    
    parser.add_argument(
        "-cl", "--clear", 
        kind=bool, 
        assist="Clear current assortment information"
    )
    
    parser.add_argument(
        "-n", "--number", 
        kind=int, 
        default=1, 
        assist="Variety of outcomes to be fetched from the gathering"
    )

    # Parse the command-line arguments
    args = parser.parse_args()

Now we have outlined just a few sub-commands, reminiscent of –file, –worth, –query, and many others.

  • –file: The string file path of a PDF.
  • –worth: An non-obligatory parameter worth that defines the variety of phrases in a textual content chunk.
  • –query: Takes a person question as a parameter.
  • — quantity: Variety of related chunks to be fetched.
  • –clear: Clears the present Chromadb assortment.

Now, we course of the arguments;

 if args.file shouldn't be None:
        verify_pdf_path(args.file)
        affirmation = add_text_to_collection(file = args.file, phrase = args.worth)
        print(affirmation)

 if args.query shouldn't be None:
        if args.quantity:
            n = args.quantity
        reply = get_answer(args.query, n = n)
        print("Reply:", reply)

 if args.clear:
        clear_coll()
        return "Present assortment cleared efficiently"

Placing every part collectively.

import argparse
from app import (
    add_text_to_collection, 
    get_answer, 
    verify_pdf_path, 
    clear_coll
)

def essential():
    # Create a command-line argument parser with an outline
    parser = argparse.ArgumentParser(description="PDF Processing CLI Device")
    
    # Outline command-line arguments
    parser.add_argument("-f", "--file", assist="Path to the enter PDF file")
    
    parser.add_argument(
        "-c", "--count",
        default=200, 
        kind=int, 
        assist="Non-obligatory integer worth for the variety of phrases in a single chunk"
    )
    
    parser.add_argument(
        "-q", "--question", 
        kind=str,
        assist="Ask a query"
    )
    
    parser.add_argument(
        "-cl", "--clear", 
        kind=bool, 
        assist="Clear current assortment information"
    )
    
    parser.add_argument(
        "-n", "--number", 
        kind=int, 
        default=1, 
        assist="Variety of outcomes to be fetched from the gathering"
    )

    # Parse the command-line arguments
    args = parser.parse_args()
    
    # Examine if the '--file' argument is supplied
    if args.file shouldn't be None:
        # Confirm the PDF file path and add its textual content to the gathering
        verify_pdf_path(args.file)
        affirmation = add_text_to_collection(file=args.file, phrase=args.depend)
        print(affirmation)

    # Examine if the '--question' argument is supplied
    if args.query shouldn't be None:
        n = args.quantity if args.quantity else 1  # Set 'n' to the required quantity or default to 1
        reply = get_answer(args.query, n=n)
        print("Reply:", reply)

    # Examine if the '--clear' argument is supplied
    if args.clear:
        clear_coll()
        print("Present assortment cleared efficiently")

if __name__ == "__main__":
    essential()

Now open your terminal and run the beneath script.

 python cli.py -f "path/to/file.pdf" -v 1000 -n 1  -q "question"
PDF Chatbot without Langchain

To delete the gathering, kind

python cli.py -cl True

If the supplied file path doesn’t belong to a PDF, it should increase a FileNotFoundError.

File not found error | PDF Chatbot without Langchain

The GitHub Repository: https://github.com/sunilkumardash9/pdf-cli-chatbot

Actual-world Use Circumstances

A chatbot operating as a CLI instrument can be utilized in lots of real-world functions, reminiscent of

Tutorial Analysis: Researchers typically cope with quite a few analysis papers and articles in PDF format. A CLI chatbot may assist them extract related data, create bibliographies, and set up their references effectively.

Language Translation: Language professionals can use the chatbot to extract textual content from PDFs, translate it, after which generate translated paperwork, all from the command line.

Academic Establishments: Lecturers and educators can extract content material from instructional sources to create personalized studying supplies or to arrange course content material. College students can extract helpful data from massive PDFs from the chatbot CLI.

Open Supply Venture Administration: CLI chatbots will help open-source software program initiatives handle documentation, extract code snippets, and generate launch notes from PDF manuals.

Conclusion

So, this was all about constructing a PDF Q&A chatbot with a Command Line Interface constructed with out utilizing frameworks such because the Langchain and Llama Index. Here’s a fast abstract of issues we coated.

  • Langchain and different AI frameworks will be a good way to get began with AI growth. Nonetheless, it’s vital to do not forget that they don’t seem to be a silver bullet. They will make your code extra complicated and might trigger bloat, so use them solely once you want them.
  • Using frameworks is smart when the complexity of initiatives requires longer engineering hours if carried out from scratch.
  • A doc Q&A workflow will be designed from scratch with out a framework like Langchain from the primary precept.

So, that is it. I hope you favored the article.

Ceaselessly Requested Query

Q1. What’s a chatbot pdf?

A. A chatbot PDF is an interactive bot specifically designed to retrieve data from PDFs.

Q2. What’s Langchain used for?

A. LangChain is an open-source framework that simplifies the creation of functions utilizing massive language fashions. It may be used for quite a lot of duties, together with chatbots, doc evaluation, code evaluation, query answering, and generative duties.

Q3. Is chatbot an AI instrument?

A. Sure, chatbots are AI instruments. They use synthetic intelligence (AI) and pure language processing (NLP) to simulate human dialog. Chatbots can be utilized to supply customer support, reply questions, and even generate inventive content material.

This fall. What are Chatbots for PDFs used for?

A. Chatbots for PDF are instruments that help you work together with PDF recordsdata utilizing pure language. You’ll be able to ask questions in regards to the PDF, and Chatbot for PDF will attempt to reply them. It’s also possible to ask a PDF Chatbot to summarize the PDF or to extract particular data from it.

Q5. Can I chat with a PDF?

A. Sure, with the appearance of succesful Giant Language Fashions and vector shops, it’s attainable to speak with PDFs.

The media proven on this article shouldn’t be owned by Analytics Vidhya and is used on the Writer’s discretion.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button