Archives For February 2014

By Derek Willis

As much as we’d love to be able to report that election agencies only deliver results in structured data files, that’s not the case. In many states we’re fortunate to find electronically-generated PDFs, which at least contain the promise of data if not always the ease of access.

Take New Mexico, for example. The Secretary of State’s office provides precinct-level results files for each county, including absentee ballots. But the files are PDFs, and there’s another catch: although the results are typical rows and columns as you might find in a spreadsheet, the column headers listing the candidate names are vertically aligned, like so:

Bernalillo County

That causes some issues when either copying and pasting directly from the PDF or when using a utility such as xPDF, which converts electronic PDFs into text files while attempting to preserve the layout. The vertical alignment issue will require a certain amount of manual intervention, but the question is: how much?

We’ve been looking at Tabula, an open-source project created by Manuel Aristarán with the support of ProPublica, La Nación DATA and Knight-Mozilla OpenNews. It helps extract structured data from PDFs, but allows some user control over the process. Tabula users draw a box around the area of the PDF they want to extract, and then can copy the table as a CSV or tab-delimited text file.

When I learned about the PDF Liberation Hackathon being held in several cities in January, I gently asked on Twitter what might be possible with a New Mexico results file. Turns out, quite a lot.

Jeremy Merrill, one of Tabula’s developers, responded that Tabula could extract the data, but had the same issue with the vertical headers. Unlike other text extraction utilities, however, Tabula could work around the headers, because it allows you to select an area of the PDF that you want to extract. So I took that route, drawing a box over just the data from the presidential race.

Tabula admin

The results are impressive, and Tabula makes it easy to grab or download the data:

Tabula results

Some issues remain – the empty columns in the PDF seem to cause Tabula some issues, so extracting all of the data on each page could require drawing multiple boxes. But that’s still better than trying to clean up problematic text dumped out randomly with the data columns interspersed. Tabula may be especially useful with results files at the local level, which often are published as PDFs.

As we work with the data from more states, we’re finding that many will have their own challenges. As a result, we’re not only going to extract the data that we need, we’re also adding a state-specific file to our main repository that will explain the steps we took to retrieve and process the results. That will include listing any software we used, so that people who want to replicate or check our processes can retrace our steps as closely as possible.

Some states that use PDFs will be easier to work with, while others will be much harder. It’s nice to know that there are open source tools that are getting more sophisticated and useful, and that we can make them a part of our effort.

149170_1674928593923_1258700749_1830798_716014_n

The OpenElections team is very pleased to welcome Geoff Hing as our newest staff member. In his role as OpenElections Developer, Geoff will be working closely with Derek and Serdar on the core architecture of the project, as well as overseeing and assisting volunteer coders as they help wrangle election data from across the country.

Geoff is a technologist and cultural worker who lives in Chicago.  He works developing technologies that intersect with community information needs, civic data and participatory governance.  His work reflects connections with engineering, journalism and do-it-yourself culture. He joins us from recent work for Floodlight Project and The Work Department.  In the past he has done data-driven technology projects with Food Genius and Metro Chicago Information Center@geoffhing

Welcome aboard, Geoff!

Office_TagCloud

.
Georgia has a Commissioner of Labor, a Superintendent of Education and an Attorney General among its statewide elected officers. In Kentucky, there’s an Auditor of Public Accounts. In Virginia, Governor and Lieutenant Governor are elected separately.

One of the challenges for OpenElections is reconciling election results from all 50 states, and doing so in a way that allows our users to be able to get the data they want, even across states.

An obvious example is presidential elections, which essentially are 50 different elections for the same office. While it’s relatively easy to identify and reconcile the various ways states refer to the office of the President, doing so for the variety of statewide offices gets a little tricky:

  • Are a treasurer and a comptroller basically the same position?
  • How do we classify offices that seem to cover several jobs, like Florida’s Commissioner of Agriculture and Consumer Services?
  • While every state has an elected Governor, not every state has an Adjutant General (South Carolina) or a Mine Inspector (Arizona). How do we handle those?

And then there’s the different titles that state lawmakers hold. Clearly, we need some form of standardization that would allow users to easily compare certain types of statewide elected offices across states (or, crucially, across time within a state, since a number have revamped their offices). To that end, we’ve put together a list of some offices and some possible standardized names for them on our wiki.

As before, we’re asking our users and partners to weigh in – does our general approach make sense, or are we overlooking some other schema that would make our job easier? Whatever the case, let us know in our Google Group.