I have been trying to find a way to use ChatGPT to “chat” with a local collection of documents, ideally PDFs, TXTs and potentially other formats. It turns out there are a lot of YouTube videos on the topic…and a bunch of approaches, but it seems to basically come down to:
Read the documents
Convert them into “chunks” of text that can be ingested into GPT
Convert these chunks into “embeddings”
Add the “embeddings” to a Vector Data Store
Ask a question
Query the Vector Data Store for a semantic match
Send that match to the LLM
Present the answer to the user
Potentially, add the question/answer pair to a buffer/memory to use in an ongoing conversation.
In a video by Alejandro Ao, he provided this miro.com illustration of the process:
Alejandro credits the origin of the diagram to Benny Cheung here.
Here are the three videos that form the basis of my early research:
The three developers each take a similar approach, but their details are different. All three leverage LangChain functions. They generally use OpenAI for the Large Language Model (LLM). They use a number of options for the Vector Data Store:
DuckDB ( I see this mentioned but I am not certain it is a Vector Data Store)
The LLM is needed to answer the questions, but also to create the “embeddings” from the document chunks. This incurs a fee from OpenAI if you use their APIs (but the cost is honestly pretty low). You can use alternatives. Alejandro mentions using “instructor-xl” (intro) as it currently ranks higher than OpenAI (text-embedding-ada-002) on the Massive Text Embedding Benchmark leaderboard.
Some of the things I want to modify include
Include a persistent vector database – Creating the embeddings is an “expensive” process. After experimentation, I would like to keep the database around so I can query later.
Multiple databases – I have several different topics I want to research and want to keep these seperate.
Question both the local docs as well as the broader training data available to models like GPT-4
Provide a web GUI
As I have been tweaking with the various “Chat with your PDF” tutorials, I came across Matthew Berman’s video introduction to Iván Martínez’s PrivateGPT. While the idea here is to use local LLMs like GPT4All for a 100% private implementation, it has many of the features I was looking for. With the help of some community member ideas, I was able to tweak it to use OpenAI’s API for embeddings and queries. The bit it is missing (currently) is a WebGUI.
Another informational video was from Venelin Valkov on using GPT4All on free and local LLMs.