Pulling data from PDFs is this developers’ challenge and nightmare. Here is an approach using Microsoft Excel and “Data Query” (or is it “Power Query”?)
This is a piece for advanced VBA coders. You can find the workbook here.
The first trick is to get a list of the “pages” or tables in the PDF. That’s important because we’ll need to code custom parsing routines. They’ll only work with specific PDF pages or tables.
To explain my techniques I’ve created a project around extracting data from the U.S. Department of Labor weekly unemployment reports.
They look like this
It has many tables of data, of which most data can be accessed through the DOL’s public database.
One part of the report that isn’t database accessible are the comments from each State.
To figure out how the PDF are structured, internally, I open “Data Query” in Excel and start the data import process for a PDF file. It returns something like this..
Fortunately, these files are fairly standardized at around 12 pages. So I can focus on a few tables at the end.
Before I go on, I’ll point out the most important thing I’ve learned working with PDFs in Excel and Data Query.
Data Query / Power Query is a separate piece of software that answers to no one, least of all Excel VBA; that is, it sometimes says it’s done with a task, but isn’t, and throws an error, like below:
Another weirdness of Dataquery is that it is repackaged for different purposes and software. You can access a version through Excel, or use a standalone “Power BI Desktop” version.
Until this PDF project, when I needed to import data I always tried to do everything in VBA.
In my opinion, Data Query is half-baked and not fully integrated into Excel. When I first tried Data Query I would see “Close and Load” on the left. First thing after opening and it expects that the first thing I’ll do is close it?
“New Query” should be on the left and “Close and Load” should be on the right. That kind of stuff annoys me to no end, but I have to deal with it here. When it comes to PDFs, “Data Query” has some neat tricks up its sleeve.
In the modern world of object oriented programming you’d expect that you can access Data Queries’ object model through VBA. I’d expect to do something like this.
Excuse this pseudo-code, but to give the idea...
None of the above is possible, far as I can tell. Instead, we must either work within Power Query and then take the M Language and kludge it to use values from Excel.
Is the above clear? Unfortunately not. You have to read forum posts where others have figured out these tricks; that is, tricks to get the “M” statements, or formulas, to talk to Excel.
Microsoft created a new language for Data Query called “M” (for Muggles I believe). Here are some M statements in my power query.
The main takeaway is I can jimmy the file name into a pre-existing query and then, in VBA, call the saved query with
ActiveWorkbook.Connections(“Query — GetPDFListOfPagesAndTables”).Refresh
Basic steps are 1) Put in range names in the M statements to communicate with data in Excel sheets. 2) In VBA, run Power Queries to evaluate the PDFs, then pull out the desired data.
If you’re confused, sorry. The main thing is that using the above we have a list of pages and tables from the PDF. We need that so we can select only those pages and tables that have the data we want and on which our parser will work. Moving on.
Below, after checking for an existing query, I create what I call a “slug” query that starts with a basic configuration. I point it to the PDF file, but I put in a placeholder, “PageXXX” for what section I’ll want to work with in the PDF.
Why not do this all in one shot? Again, the integration between Excel and Data Query is almost non-existent in VBA, so when debugging issues I need to make sure I can break between processes.
Next, I run a dummy query, which is a kludge to make sure Data Query has finished the above. The logic is it has to finish A to move onto B.
Again, Excel and Power Query do not communicate with each other as well as they should.
Next, I run the query by changing the “formula” in the object to section “xPage(p)” that I want pulled to that worksheet. Then I add the query to the sheets “List Objects”. When the “.refresh BackgroundQuery = False” is fired the query will run and place the data in the worksheet.
Yet, even with the dummy query above, and my many attempts to set a property in the query to let Excel know it’s truly done, the above occasional error popup will not go away.
Therefore, I have a variable, which a user can set based on the speed of their PC, which will do an old-fashioned sleep.
I end up with a sheet of extracted data like so
It then gets appended to a compiled sheet, with the table’s associated PDF filename, using basic VBA, like so
So there you have it. I run the worksheet in a folder with as many DOL weekly reports as I want and it will create a table, or database, of weekly comments from each State.