Archives For Results

The Price of OpenElections

December 25, 2015

As we wrap up OpenElections’ work in 2015, we’d like to give you an update on how we’ve spent not only our time but also the money that we’ve gotten, particularly from the John S. and James L. Knight Foundation’s Knight News Challenge. Most of that money that we’ve spent since mid-2013 has gone to salaries for our project manager and a single developer. Neither of the project’s co-founders has been paid for working on OpenElections, and we’ve tried to keep our operations pretty lean.

While our initial grant funding from Knight is nearly exhausted, we’ve made good progress and will keep going. In the past few months we’ve added a few more states (Louisiana, Missouri and Virginia) and we have volunteers working on Wisconsin, Georgia and Oregon, among others. We’ve revised our volunteer documentation to make it easier to understand what we’re doing and how you can help.

In most states, getting county-level data isn’t too much of a problem; that data is usually available online, if not always in native electronic formats. County-level data is usually freely available as well, but we’ve always wanted to develop a resource that can offer precinct-level results where they are available. Here’s why: while counties can be homogenous, precincts are even more distinct and smaller political units and lend themselves to more sophisticated analysis. Candidates and their campaigns care about precinct results. Journalists and researchers should do the same.

Some states make precinct-level data available for free, which is a great service to the public. They include Louisiana, Maryland, Wyoming, West Virginia, Virginia, North Carolina and Florida, among others. Some states, like Pennsylvania, Colorado and Utah, charge a nominal fee for precinct results. But for other states, precinct results are only available county-by-county, and that takes both time and money. We’ve written about Oregon in the past, and we’d like to offer it as an example of the price of precinct results.

The bad news is that it’s not uniform, even within a state. In Oregon we’ve spent more than $1,000 to obtain precinct-level results covering elections 2000-2014, although in many cases we don’t have all of those years. Some counties literally don’t have results to give us before 2010. Crook County was unable to find precinct results for the 2010 general and 2012 primary elections, while a number of counties don’t have elections from before 2008. In other cases, price was a factor: we’ll only have precinct results for 2010-2014 from Tillamook County because the clerk there charged us $222.75 for results for those years. Lake County charges $50 an hour for pulling the results files and another $0.25 a page for copying them. We’ve yet to receive those results, so we don’t know what the final cost for Lake will be.

The good news is that when we do request election results that aren’t freely available online, we’re posting them on our Github site in state-specific repositories. That way other organizations or individuals won’t have to repeat our processes and/or pay for results that we’ve already gotten. We want you to use what we’ve gathered, whether that’s CSV files or original PDFs. That’s our holiday gift to you. We’ll be back at it in 2016, when there are more elections coming up, we hear.

For OpenElections volunteers coming to NICAR in Atlanta next month, we’ve got a challenge for you: help us tackle Georgia election results.

As we did last year in Baltimore, OpenElections will hold an event on Sunday, March 8, with the goal of writing scrapers and parsers to load 2000-2014 election results from the Peach State, and we’re looking for some help. It’s a great way to get familiar with the project and see what our processes are.

Georgia offers some different tasks, from scraping HTML results to using our Clarify library to parse XML from more recent elections. So we’re looking for people who have some familiarity with Python and election results, but we’re happy to help guide those new to the process, too. Thanks to our volunteers, we’ve already got a good record of where election result data is stored by the state.

Here’s how the process will work: we’ll start by reviewing the state of the data – what’s available for which elections – and then start working on a datasource file that connects that data to our system. After that, we’ll begin writing code to load results data, using other states as our models. As part of that process, we’ll pre-process HTML results into CSVs that we store on Github.

If you’re interested in helping out, there are two things to do: first, let us know by emailing openelections@gmail.com or on Twitter at @openelex. Second, take the time to setup the development environment on your laptop following the instructions here. We’re looking forward to seeing you in Atlanta!

When we released our initial dashboard for downloading election results in July, we wanted to make it easy for anyone to grab CSV files of raw results with just a browser. We’ve continued adding states to our results site, the latest being North Carolina, Florida and — for a few elections — Mississippi. Pennsylvania will be on the way soon.

But we also wanted our results to be usable by developers as well, and we’re taking advantage of Github to help make that easier. Each time that we publish raw results data, which hasn’t been standardized beyond geography — we publish it first to a GitHub repository for that state. For example, you can find a repository for Mississippi results that can be cloned and/or accessed via API, avoiding manual downloads. The naming convention for the repositories is the same: openelections-results-{state}, and you might find partial results for states that don’t yet appear on the download map (like Iowa) because they’re still in progress.

Screen Shot 2014-12-22 at 2.18.09 PM

Using GitHub has two advantages for us — it helps to maintain a history of published changes, of course — but GitHub Pages also provides a filesystem for storing the raw CSVs that power the results site downloads. And should we need to move the CSV downloads to another location, we can do that, too. All of this underscores our commitment to using existing standards and practices rather than inventing new ones.

So if you were looking for election results CSVs as part of your holiday plans, we’ve got two ways to get them. Enjoy, and Happy New Year!

Introducing Clarify

November 26, 2014

An Open Source Elections-Data URL Locator and Parser from OpenElections

By Geoff Hing and Derek Willis, for Knight-Mozilla OpenNews Source Learning

state_summary_page__KY

State election results are like snowflakes: each state—often each county—produces its own special website to share the vote totals. For a project like OpenElections, that involves having to find results data and figuring out how to extract it. In many cases, that means scraping.

But in our research into how election results are stored, we found that a handful of sites used a common vendor: Clarity Elections, which is owned by SOE Software. States that use Clarity genferally share a common look and features, including statewide summary results, voter turnout statistics, and a page linking to county-specific results.

The good news is that Clarity sites also include a “Reports” tab that has structured data downloads in several formats, including XML, XLS, andCSV. The results data are contained in .ZIP files, so they aren’t particularly large or unwieldy. But there’s a catch: the URLs aren’t easily predictable. Here’s a URL for a statewide page:

http://results.enr.clarityelections.com/KY/15261/30235/en/summary.html

The first numeric segment—15261 in this case—uniquely identifies this election, the 2010 primary in Kentucky. But the second numeric segment—30235—represents a subpage, and each county in Kentucky has a different one. Switch over to the page listing the county pages, and you get all the links. Sort of.

The county-specific links, which lead to pages that have structured results files at the precinct level, actually involve redirects, but those secondary numeric segments in the URLs aren’t resolved until we visit them. That means doing a lot of clicking and copying, or scraping. We chose the latter path, although that presents some difficulties as well. Using our time at OpenNews’ New York Code Convening in mid-November, we created a Python library called Clarify that provides access to those URLs containing structured election results data and parses the XML version of it. We’re already using it in OpenElections, and now we’re releasing it for others who work in states that use Clarity software.

See full piece on Source Learning

At ONA14 in Chicago in late September we unveiled the new OpenElections data download interface. We presented at the Knight Foundation’s Knight Village during their office hours for featured News Challenge projects, as well as during a lighting talk. OpenElections’ Geoff Hing and Sara Schnadt showed off their handiwork based on in-depth discussions and feedback from many data journos. The crowd at ONA was receptive, and the people we talked to were keen to start having access to the long awaited data from the first few states.

Screen Shot 2014-10-06 at 2.47.55 PM

As you can see from the data map view above, there are only three states that have data available so far. These are Maryland, West Virginia and Wyoming, for which you can download ‘raw’ data. For our purposes, this means that you can get official data at the most common results reporting levels, with the most frequently used fields identified but without any further standardization. We will have ‘raw’ data on all the states in the next few months, and will work on having fully cleaned and standardized data on all the states after this initial process is complete.

Screen Shot 2014-10-06 at 2.48.12 PM

As things progress, you will see updates to both the map view and the detailed data view where you can see the different reporting levels that have data ready for download so far.

Screen Shot 2014-10-06 at 4.30.19 PM

A pink download icon indicates available data, and a grey icon indicates that data exists for a particular race at a particular reporting level, but that we don’t yet have it online.

Screen Shot 2014-10-06 at 4.28.56 PM
The race selection tool at the top of the page includes a visualization that gives an overview of all the races in our timespan, and a slider for selecting a date range to review races in the download table. For states like Maryland (shown in the full page-view above), there are only two races every two years so this slider isn’t so crucial, but for states like Florida (directly above), this slider can be useful.

We encourage you to take the interface for a spin, and tell us what you think! And, if you would like to help us get more data into this interface faster, and you are fairly canny with Python, we would love to hear from you. You can learn more about what this would entail here.

Screen Shot 2014-05-26 at 1.10.01 PM

As part of National Day of Civic Hacking, we are organizing an OpenElections challenge for the hacking events at locations all over the country – Sat May 31 and Sun June 1st.

If you are attending one of these events near you, and would like to join in on our effort to write scrapers for elections results, let us know!

Write Scrapers for us…
Help us extend our core scraper architecture to create a series of custom scrapers that account for the idiosyncrasies in how each state structures data, stores it, and makes it available.

**Our docs for this process are now up on our site. Look here to see what would be involved with joining in**

Your time and expertise would be most appreciated either day. Also, feel free to join in from home.

If you would like to help out, email sschnadt.projects@gmail.com either or tweet at us @OpenElex either before the event or on the day. Our team will be online and available to get you set up.

Thank you!

The OpenElections Team

When we embarked on this quest to bring sanity to election data in the U.S., we knew we were in for a heavy lift.

A myriad of data formats awaited us, along with variations in data quality across states and within them over time.  In the past few months, the OpenElections team and volunteers have crafted a system to tame this wild landscape. This post takes a closer look at how we applied this system to Maryland, the first state that we took on to define the data workflow process end to end. Hopefully it helps shine some light on our process and generates ideas on how we can improve things.

The Data Source

Maryland offers relatively clean, precinct-level results on the web. In fact, it provides so many result CSVs (over 700!) that we abstracted the process for generating links to the files, rather than scraping them off the site .

Other states provide harder-to-manage formats such as database dumps and image PDFs that must be massaged into tabular data. We’ve devised a pre-processing workflow to handle these hard cases, and started to apply it in states such as Washington and West Virginia.

The common denominator across all states is the Datasource. It can be a significant effort to wire up code-wise, but once complete, it allows us to easily feed raw results into the data processing pipeline.  Our goal in coming months is to tackle this problem for as many states as possible, freeing contributors to work on more interesting problems such as data loading and standardization.

Raw Results

When the datasource was in place, we were ready to load Maryland’s data as RawResult records in Mongo, our backend datastore. The goal was to minimize the friction of initial data loading. While we retained all available data points, the focus in this critical first step was populating a common core of fields that are available across all states.

In Maryland, this meant writing a series of data loaders to handle variations in data formats across time. Once these raw result loaders were written, we turned our attention to cleanups that make the data more useful to end users.

Transforms

Loading raw results into a common set of fields is a big win, but we’ve set our sights much higher. Election data becomes much more useful after standardizing candidate names, parties, offices, and other common data points.

The types of data transforms we implement will vary by state, and in many cases, one set of cleanups must precede others. Normalizing data into unique contests and candidates is a transform common to all states, usually one that should be performed early in the process.

Transforms let us correct, clean or disambiguate results data in a discrete, easy-to-document, and replicable way.  This helps keep the data loading code simple and clear, especially when dealing with varying data layouts or formats between elections.

In Maryland, we used the core framework to create unique Contest and Candidate records for precinct results. These transforms included:

This opened the door to generating totals at the contest-wide level for each candidate.

Validations

At this point, you might be getting nervous about all this processing.  How do we ensure accuracy with all this data wrangling? Enter data validations, which provide a way to link data integrity checks with a particular transformation, or more broadly check data loading and transformation.  In Maryland, for example, we implemented a validation and bound it to a transform that normalizes the format of precinct names.  In this case, the validation acts like a unit test for the transform.  We also cross-check the loaded and transformed result data in validations that aren’t bound to specific transforms to confirm that we’ve loaded the expected number of results for a particular election or ensure that the sum of a candidate’s sub-racewide vote totals matches up with published racewide totals.

Implementing and running validations has helped us uncover data quirks, such as precinct-level data reflecting only election day vote totals, while result data for other reporting levels includes absentee and other types of votes. Validations have also exposed discrepancies between vote counts published on the State Board of Elections website and ones provided in CSV format.  We’ve circled back to Maryland officials with our findings, prompting them to fix their data at the source.

Summary

Maryland has been a guinea pig of sorts for the OpenElections project (thank you Maryland!).  It’s helped us flesh out a data processing framework and conventions that we hope to apply across the country.  Of course, challenges remain: standardizing party names across states; mapping precincts to counties; and sundry other issues we didn’t cover here remain a challenge.

As we tackle more states, we hope to refine our framework and conventions to address the inevitable quirks in U.S. election data . Meantime, we hope this provides a window into our process and gives you all some footing to make it easier to contribute.

As our volunteers have gathered details on the scope and availability of election results from across the country, one thing became clear: not all election results are created equal.

Some states provide results data in multiple formats and variations; all you have to do is choose and click. Florida has a download for every election. In Pennsylvania, we found that for $7 the state provides a CD with 12 years of consistently formatted results. Idaho has multiple files covering different reporting levels.

But that’s not every state. Some have made the transition from producing PDFs to CSVs, such as West Virginia. Others, like Mississippi, basically provide a picture of the results. For states where data isn’t the norm, OpenElections needs to fill the gap, turning results into data.

This isn’t a glamorous job, but we’d like to tell you a little about how we go about it. For states that provide electronically-generated PDFs, like West Virginia does for elections from 2000-2006, there are several good options for parsing PDF tables into data. The command-line utility pdftotext from the xPDF package works well in many cases, while the excellent Tabula (a product of ProPublica, La Nación and the Knight-Mozilla OpenNews project) can do wonders with more complex files. For West Virginia, xPDF was all we needed (along with some search and replace in a text editor) to make CSV files from the original PDFs. Here’s an example command that generates a fixed-width text file while preserving the rows and column format of the original file:

$ pdftotext -layout 2000 House of Rep Pri.pdf

We used TextWrangler, a free text editor for the Mac, to convert the spaces between columns into tabs, and from there it was trivial to copy results into CSV files. In the process of converting these results, we found several apparent errors (typos or likely copy and paste mistakes) and notified the Secretary of State’s office. To its credit, the office responded quickly and is in the process of fixing the original files (and we’ll update our CSVs when they do).

In Mississippi, however, there’s are no programmatic options, or at least no good ones. Data entry is the best way for us to get precinct-level results that are contained in county-by-county files like this one. Here’s what we’re dealing with: a scanned image of a fax:

 

Screen Shot 2014-05-05 at 8.31.51 PM

 

When it comes to doing data entry, we need to be very specific about what we want and how we want it stored in the CSV file. For our Mississippi files, we’ve developed a guide to the process that we’ll adapt to other states where manual entry is required. Which is where you come in, if you’re up for it. If you’d like to try your hand at a Mississippi file (or another state with PDFs) let us know in the Google Group and we can get you setup. Or you can fork the Mississippi results repository on Github and send us an email per the instructions in the README file.

We know that data entry is neither fun nor exciting (well, most of the time), but think of this: you’ll be part of a project that will provide a great service to journalists, researchers and citizens. And we still have some t-shirts left, too.

NICAR 14 Hackathon

By Sara Schnadt

This year’s NICAR conference was an especially great experience for me. Having spent the past year working remotely with volunteers around the country to develop the groundwork for the OpenElections project, I met so many volunteers in person, featured them in our project update session, and worked alongside them at our day-long hackathon on the last day of the conference, and this made working on the project so much more meaningful.

From meeting Sandra Fish by having her pass by me in the throngs of in-between-session milling to excitedly hand off a CD of Colorado results data, to Ed Borasky telling me over our computers at the hackathon that there is a large and close-knit network of journalists in his local Portland area who would be very supportive of our work, to noticing after many hours of working together that our own Derek Willis and Nolan Hicks have very similar senses of humor, NICAR was a great and constructive convergence of OpenElections supporters.

We were also very pleased to have new volunteers join us for the hackathon, including the likes of NPR’s Jeremy Bowers, and the Chicago hacker Nick Bennett helping with scraper writing and data processing. Bloomberg designer Chloe Whiteaker and civic dev extraordinaire Margie Roswell also blithely drafted us a new public-facing site in a matter of hours. And then there was Bloomberg visualization dev Julian Burgess, who spent most of the day with us, at first trying his hand at learning python just so he could pitch in, then giving an in-depth assessment of our interface and data acquisition strategies. I am new to digital journalism as of this past year, and I have to say I am very taken by the generosity, talent, and character of the people in this space.

More than anything else, meeting all these great folks in person brought home just how important it is to digital journalists to create new civic infrastructure where it doesn’t already exist, and to see how invested you all are in seeing this project succeed. During our session ‘OpenElections, a year in review’, in addition to a detailed update on our progress gathering metadata with our small army of volunteers, and defining a core results data scraper spec, there were spirited discussions about the technical nuances and interesting challenges of our system architecture. These challenges are inherent in taking a motley and wildly varied collection of individual states’ election results archiving methods and creating a new, clean, systematic, national infrastructure. The interest and investment were palpable in the room.

From all of this, it was clear that we are on the right track, and we left with new motivation, support, perspective, talent and stamina to bring the project home in our second year!

 

By Derek Willis

As much as we’d love to be able to report that election agencies only deliver results in structured data files, that’s not the case. In many states we’re fortunate to find electronically-generated PDFs, which at least contain the promise of data if not always the ease of access.

Take New Mexico, for example. The Secretary of State’s office provides precinct-level results files for each county, including absentee ballots. But the files are PDFs, and there’s another catch: although the results are typical rows and columns as you might find in a spreadsheet, the column headers listing the candidate names are vertically aligned, like so:

Bernalillo County

That causes some issues when either copying and pasting directly from the PDF or when using a utility such as xPDF, which converts electronic PDFs into text files while attempting to preserve the layout. The vertical alignment issue will require a certain amount of manual intervention, but the question is: how much?

We’ve been looking at Tabula, an open-source project created by Manuel Aristarán with the support of ProPublica, La Nación DATA and Knight-Mozilla OpenNews. It helps extract structured data from PDFs, but allows some user control over the process. Tabula users draw a box around the area of the PDF they want to extract, and then can copy the table as a CSV or tab-delimited text file.

When I learned about the PDF Liberation Hackathon being held in several cities in January, I gently asked on Twitter what might be possible with a New Mexico results file. Turns out, quite a lot.

Jeremy Merrill, one of Tabula’s developers, responded that Tabula could extract the data, but had the same issue with the vertical headers. Unlike other text extraction utilities, however, Tabula could work around the headers, because it allows you to select an area of the PDF that you want to extract. So I took that route, drawing a box over just the data from the presidential race.

Tabula admin

The results are impressive, and Tabula makes it easy to grab or download the data:

Tabula results

Some issues remain – the empty columns in the PDF seem to cause Tabula some issues, so extracting all of the data on each page could require drawing multiple boxes. But that’s still better than trying to clean up problematic text dumped out randomly with the data columns interspersed. Tabula may be especially useful with results files at the local level, which often are published as PDFs.

As we work with the data from more states, we’re finding that many will have their own challenges. As a result, we’re not only going to extract the data that we need, we’re also adding a state-specific file to our main repository that will explain the steps we took to retrieve and process the results. That will include listing any software we used, so that people who want to replicate or check our processes can retrace our steps as closely as possible.

Some states that use PDFs will be easier to work with, while others will be much harder. It’s nice to know that there are open source tools that are getting more sophisticated and useful, and that we can make them a part of our effort.