When we embarked on this quest to bring sanity to election data in the U.S., we knew we were in for a heavy lift.
A myriad of data formats awaited us, along with variations in data quality across states and within them over time. In the past few months, the OpenElections team and volunteers have crafted a system to tame this wild landscape. This post takes a closer look at how we applied this system to Maryland, the first state that we took on to define the data workflow process end to end. Hopefully it helps shine some light on our process and generates ideas on how we can improve things.
The Data Source
Maryland offers relatively clean, precinct-level results on the web. In fact, it provides so many result CSVs (over 700!) that we abstracted the process for generating links to the files, rather than scraping them off the site .
Other states provide harder-to-manage formats such as database dumps and image PDFs that must be massaged into tabular data. We’ve devised a pre-processing workflow to handle these hard cases, and started to apply it in states such as Washington and West Virginia.
The common denominator across all states is the Datasource. It can be a significant effort to wire up code-wise, but once complete, it allows us to easily feed raw results into the data processing pipeline. Our goal in coming months is to tackle this problem for as many states as possible, freeing contributors to work on more interesting problems such as data loading and standardization.
When the datasource was in place, we were ready to load Maryland’s data as RawResult records in Mongo, our backend datastore. The goal was to minimize the friction of initial data loading. While we retained all available data points, the focus in this critical first step was populating a common core of fields that are available across all states.
In Maryland, this meant writing a series of data loaders to handle variations in data formats across time. Once these raw result loaders were written, we turned our attention to cleanups that make the data more useful to end users.
Loading raw results into a common set of fields is a big win, but we’ve set our sights much higher. Election data becomes much more useful after standardizing candidate names, parties, offices, and other common data points.
The types of data transforms we implement will vary by state, and in many cases, one set of cleanups must precede others. Normalizing data into unique contests and candidates is a transform common to all states, usually one that should be performed early in the process.
Transforms let us correct, clean or disambiguate results data in a discrete, easy-to-document, and replicable way. This helps keep the data loading code simple and clear, especially when dealing with varying data layouts or formats between elections.
- parsing candidate names
- standardizing party names
- standardizing office names
- standardizing political jurisdictions
This opened the door to generating totals at the contest-wide level for each candidate.
At this point, you might be getting nervous about all this processing. How do we ensure accuracy with all this data wrangling? Enter data validations, which provide a way to link data integrity checks with a particular transformation, or more broadly check data loading and transformation. In Maryland, for example, we implemented a validation and bound it to a transform that normalizes the format of precinct names. In this case, the validation acts like a unit test for the transform. We also cross-check the loaded and transformed result data in validations that aren’t bound to specific transforms to confirm that we’ve loaded the expected number of results for a particular election or ensure that the sum of a candidate’s sub-racewide vote totals matches up with published racewide totals.
Implementing and running validations has helped us uncover data quirks, such as precinct-level data reflecting only election day vote totals, while result data for other reporting levels includes absentee and other types of votes. Validations have also exposed discrepancies between vote counts published on the State Board of Elections website and ones provided in CSV format. We’ve circled back to Maryland officials with our findings, prompting them to fix their data at the source.
Maryland has been a guinea pig of sorts for the OpenElections project (thank you Maryland!). It’s helped us flesh out a data processing framework and conventions that we hope to apply across the country. Of course, challenges remain: standardizing party names across states; mapping precincts to counties; and sundry other issues we didn’t cover here remain a challenge.
As we tackle more states, we hope to refine our framework and conventions to address the inevitable quirks in U.S. election data . Meantime, we hope this provides a window into our process and gives you all some footing to make it easier to contribute.