PDF Chatbot

I have been trying to find a way to use ChatGPT to “chat” with a local collection of documents, ideally PDFs, TXTs and potentially other formats. It turns out there are a lot of YouTube videos on the topic…and a bunch of approaches, but it seems to basically come down to:

  • Read the documents
  • Convert them into “chunks” of text that can be ingested into GPT
  • Convert these chunks into “embeddings”
  • Add the “embeddings” to a Vector Data Store
  • Ask a question
    • Query the Vector Data Store for a semantic match
    • Send that match to the LLM
  • Present the answer to the user
  • Potentially, add the question/answer pair to a buffer/memory to use in an ongoing conversation.

In a video by Alejandro Ao, he provided this miro.com illustration of the process:

Alejandro credits the origin of the diagram to Benny Cheung here.

Here are the three videos that form the basis of my early research:

Alejandro Ao
TechLead
Liam Ottley

And their code repositories:

The three developers each take a similar approach, but their details are different. All three leverage LangChain functions. They generally use OpenAI for the Large Language Model (LLM). They use a number of options for the Vector Data Store:

  • Pinecone
  • FAISS
  • ChromaBD
  • DuckDB ( I see this mentioned but I am not certain it is a Vector Data Store)

The LLM is needed to answer the questions, but also to create the “embeddings” from the document chunks. This incurs a fee from OpenAI if you use their APIs (but the cost is honestly pretty low). You can use alternatives. Alejandro mentions using “instructor-xl” (intro) as it currently ranks higher than OpenAI (text-embedding-ada-002) on the Massive Text Embedding Benchmark leaderboard.

My changes

Some of the things I want to modify include

  • Include a persistent vector database – Creating the embeddings is an “expensive” process. After experimentation, I would like to keep the database around so I can query later.
  • Multiple databases – I have several different topics I want to research and want to keep these seperate.
  • Question both the local docs as well as the broader training data available to models like GPT-4
  • Provide a web GUI
Alejandro’s WebGUI build with StreamLit

PrivateGPT

As I have been tweaking with the various “Chat with your PDF” tutorials, I came across Matthew Berman’s video introduction to Iván Martínez’s PrivateGPT. While the idea here is to use local LLMs like GPT4All for a 100% private implementation, it has many of the features I was looking for. With the help of some community member ideas, I was able to tweak it to use OpenAI’s API for embeddings and queries. The bit it is missing (currently) is a WebGUI.

Another informational video was from Venelin Valkov on using GPT4All on free and local LLMs.

Add a Comment

Your email address will not be published. Required fields are marked *