Extracting text from PDF using ChatGPT
Step 1: Choose File
First choose a PDF that we need to extract text from. I chose a chapter from the CBSE 10th class text book. https://ncert.nic.in/textbook.php?jess3=1-5
Step 2: Extract Text
Use an off the shelf tool like Pdftotext. Download the above file and run the ocrmypdf command.
pdftotext -layout “downloaded file path” output.txt
Sometimes, the text will not be extracted properly. Then you can use an OCR tool to overlay a text layer to your ODF. You can use a tool like OCRMyPDF.
ocrmypdf “downloaded file” output.pdf --force-ocr
Remember, you might need to use —force-ocr as sometimes it will detect both text and images.
Once the ocrmypdf does its job, run the output through PDFToText.
pdftotext -layout output.pdf output.txt
Using the above process we can convert the above PDF to this text which you can see in the gist and the image below.
As you can see, the text is all garbled and since it went through a OCR layer lots of mistakes are also there.
Step 3: Use ChatGPT to fix the text.
Just use the prompt: “Fix the following text” and paste the text from OCR below it. ChatGPT does its magic and generates the text as shown below.
The text from the actual text book is below:
As you can see ChatGPT took a mangled piece of text and is able to recreate the text from the OCR.