How To Get Text from an Image Only PDF File Using Linux
![How To Get Text from an Image Only PDF File Using Linux How To Get Text from an Image Only PDF File Using Linux](https://i0.wp.com/www.jamesjpn.net/wp-content/uploads/2009/11/original1.907498.png?resize=900%2C900&ssl=1)
This article is for technically savvy people, and especially for those who use Linux on their Desktop or Laptop PC as I do.
I wanted to post a 19th-century book, The Black Pope, but the text from the PDF file I downloaded is not extractable using copy and paste. I knew I needed to use OCR software to get the text. On your PC you need pdftoppm and tesseract installed. pdftoppm was already installed on my laptop by default. I just needed to install tesseract which is OCR software. I use Fedora Linux, but this proceedure will work in any distribution of Linux.
First I made a folder for Black_Pope.pdf and moved to file into the folder. Then I opened Terminal inside the folder.
These are the commands I used in Terminal to get the ASCII text.
pdftoppm -png Black_Pope.pdf black-pope
(This made PNG files of each page of the PDF. There were 404 in all.)
for i in black-pope-???.png; do tesseract "$i" "text-$i" -l eng; done;
(This command scans each of the PNG files and creates .txt files of each of them.
cat text-black-pope* > black-pope-complete.txt
(This combined all of the 404 text files into a single file.)
It only took a few minutes! The PC did all the work. Just think how long it would have taken me if I scanned each one of those PNG files one by one, and combined them all one by one. Probably an hour or more.
I wound up with a single file to proofread. After I proofread a section, I removed all the extra line returns with an online tool: Remove line breaks with paragraph restoration before I copied the text into the WordPress post.