I have about 19,000 PDFs to get into Drupal. Yeah, fun. And there aren’t any easy ways to do it. There don’t appear to be any modules that support this. So, I had to get a little creative. And I have made it work.
Biggest caveat – This doesn’t bring a nice looking PDF into Drupal preserving all the wonderful PDF formatting. It really will just bring in the textual content. Which suits my needs because the text is all I really need. This isn’t a great solution and it may be the biggest one off in my career, but if you need what the PDF says, and not how it looks, this will work for you.
One of the issues here is that the PDFs have a lot of weird formatting in them. Many are actually scans of decades old paper court documents. They wind up with all sort of page breaks, table formatting and other oddities.
This is an overview of the process – high level.
- MS Windows for the OS of the client doing all this
- Convert PDFs into DOCs. Great utility – boxoft.com *
- Use MS Word to remove faulty formatting
- MS Outlook to email docs in the body of the email
- Hotmail for the email transport to an IMAP mailbox on QMail **
- Mailhandler Drupal Module to receive the doc
- Automate the process with Macro Recorder from http://www.jitbit.com ***
** can be any Mailhandler enabled mailbox
*** not free, but a great product with a generous 40 day trial
I’m using MS product for good reasons. I did try to make this work with Open Office but it doesn’t have the features that I need.
MS Word, used with Outlook and Hotmail, will allow you to send the doc in the body of the email easily, not as an attachment. I looked for attachment handling but I didn’t really see one. Once the document has been sent in the body, it essentially loses its MS Word attributes and become simple formatted text. So, it can be easily processed by Mailhandler.
Hotmail is necessary, QMail is not. You just need Mailhandler to be able to receive emails and turn them into nodes. That is a project unto itself that I covered here a few months ago.
The Macro Recorder is for the automation part. What I was able to do was create a “map” or “procedure” of sorts consisting of keystrokes only (well, one mouse click, but no mouse movement) that is consistent for each doc. This “map” opens the file from a window already opened into MS Word. The map has tabs, backspaces, arrows, and key combinations that are consistent every time. If you don’t know keyboard shortcuts you’ll need to learn them. I suppose that you can
use the mouse more, maybe for the whole thing even but I have used macro recorders before and they are finicky and I believe that they deal best with the keyboard.
This process will require refinement: you’ll have to play with it. And it is slow. I am currently sending about 3 docs per minute. so it is going to take about (19,000 total) two weeks. But, it is a one off. And it is going to be really valuable to have the data so for me it is worth it. and once you get it moving, it doesn’t require much in the way of babysitting.
The upper image is the original,
the bottom is the result. Not great, but it gets the job done!