A couple of years back I was handpicked (not really 🙂 ) for a project that required me to work and build an application that would extract text from the PDFs uploaded into the system, using a bunch of baseline co-ordinates mapped against a sample pdf at start. The Project was named Invoice Analyser (IA) – since all the PDFs were actually invoices of our client.
I still vividly remember the day when I was made aware of the requirement on call and soon after the call ended, I knew it was going to tough but I kept faith in myself.
If I had to sum up the requirements, it would fit in 3 important points.
1. Display PDF Online.(baseline/sample pdf)
2. Map a set of co-ordinates of mark up area in a sample pdf (which is to be displayed online).
3. Extract the text from all the specified co-ordinates in any pdf uploaded henceforth (this is important because I needed those sets of co-ordinates to create a baseline sample, that once created, would be used against all similar looking PDFs to extract text out of them from relevant co-ordinates).
Now knowing all the above scenarios, I started with something that I believe everybody does (pray, no 🙂 ): googling. Luckily, I managed to knock off task 1, thanks to the amazing library called PDF.js
Those who don’t know about PDF.js well it’s an amazing Open Source library (perhaps maintain my mozilla) that focuses on a general-purpose, web standards-based platform for parsing and rendering PDFs. As far as I remember, it uses HTML5 canvas to display the PDF online (so a non HTML5 browser beware it might not work for you ).
Now, with 1 down, I concentrated my focus on tasks 2 and 3 and this was challenging. Upon my examination, I found that the co-ordinates mapped on PDF.js displayed(online) was way different from the actual pdf. Here is a link in which I asked this question on Stack Overflow (seriously after reading the first answer by @DanielLi, my confidence was deeply dented.) until, one fine day I finally got it. “That’s it. I found it. Eureka Eureka!” (was my initial reaction).
So to map the precise co-ordinates in PDF.js with respect to the actual pdf, one thing that played an important role was ASPECT RATIO (even though PDF render by PDF.js had different co-ordinates from the original PDF, it still maintains the same aspect ratio). So, with a little bit of jQuery and a little bit of Mathematics, I finally managed to nail task 2. Look I have also answered my own question here. (describing how I did it)
that would return the specified text for the supplied co-ordinate. I went with iText (choosing iText over PDFbox was a personal choice)
Disclosure : –
The project is almost 3 years old now. I’m aware that PDF.js has matured quite significantly now. For all those whose end result differs from mine(where I need to create baseline co-ordinates from the sample to use it to extract text from any pdf uploaded in future) you might want to look at the following example.
Also if you have a requirement similar to mine, you might just want to look at this answer as well
I stumbled upon this issue as the OP did and the answer helped me a lot.
So to map the extract co-ordinate in this case, just interchange your length and breadth based on orientation.
All in all, the project was seriously a challenge and being able to knock it off within the specified time frame is something I’m still proud of.
I have open sourced a demo version of the OCR which can give you a head start in building something similar to what I did.
Follow me on Twitter