In this guide, we will see how to build a chatGPT for your PDF documents i.e. an AI that will answer your questions based on a particular PDF document.
You could use this to ask questions about your textbooks, ebooks, or anything else as long as it’s in a PDF file format
We will be using
Let’s go.
The process to build a chatGPT for your PDF documents
There is the main steps we are going to follow to build a chatGPT for your PDF documents
- First, we will extract the text from a pdf document and process it and make it ready for the next step.
- Next, we will use an embedding AI model to create embeddings from this text.
- Next, we will build the query part that will take the user’s question and uses the embeddings created from the pdf document, and uses the GPT3/3.5 API to answer that question.
Requirements to build a chatGPT for your PDF documents
- We will be using OpenAI GPT-3/3.5 API for this. Grab your API key from your OpenAI Account.
- Python 3.x or higher installed on your computer.
Install Python packages
First, install the necessary python packages. Depending on your python installation, you could use pip install <package> or python -m pip install <package>. Run these from your command line program.
The python packages you need to install are:
- PyPDF2
- langchain
- openai
- faiss-cpu
Setup your working directory/folder
create a new directory or folder and create a .env file inside the folder and write below text into it
OPENAI_API_KEY=your-openai-api-key
make sure to replace the text your-openai-api-key with your actual OpenAI API key.
Import the required Python packages
You can do this in a Jupyter Notebook / Google Colab Notebook or a python .py on your computer
Make sure it’s in the same folder as the .env file you created above.
# import the modules from PyPDF2 import PdfReader from langchain.embeddings.openai import OpenAIEmbeddings from langchain.text_splitter import CharacterTextSplitter from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS import os # load .env file from dotenv import load_dotenv load_dotenv()
Process the PDF
We start with reading in the pdf document.
reader = PdfReader('my_pdf_doc.pdf')
raw_text = ''
for i, page in enumerate(reader.pages):
text = page.extract_text()
if text:
raw_text += text
text_splitter = CharacterTextSplitter(
separator = "\n",
chunk_size = 1000,
chunk_overlap = 200,
length_function = len,
)
texts = text_splitter.split_text(raw_text)
Create embeddings
Now, it’s time to create embeddings from the text chunks we created above from the pdf document.
embeddings = OpenAIEmbeddings()
import pickle
with open("foo.pkl", 'wb') as f:
pickle.dump(embeddings, f)
Query the PDF document using the embeddings
First we load the saved embeddings
with open("foo.pkl", 'rb') as f: new_docsearch = pickle.load(f)
There are two ways to query the PDF document using mebeddings
Below method will list the most similar chunks that might contain the answer to the query
docsearch = FAISS.from_texts(texts, new_docsearch)
query = "Your query here"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")
chain.run(input_documents=docs, question=query)
Conclusion
You can use this technique for all kinds of text data beyond just PDFs. You can also the techniques explained here to turn this into a web-based knowledge retrieval system.
No comments:
Post a Comment