Voter Registration Capstone Part 1

Christopher Johnson
4 min readMar 6, 2018

This is going to be the first part of a multi-blog series about my Capstone project. My capstone project is for the Data Science Immersive program I am currently in at General Assembly. Everyone in the program does a capstone to wrap up everything we have learned in the class. For my capstone, I am going to compare election results to the following year’s voter registration numbers in the state of Colorado. I have been working on this idea in my head for about a month now and am getting started to be working on this project the majority of the time. During this blog series, I am going to be walking through step by step of my thought process as I continue on my first solo data science project.

Project selection:

I knew I wanted to do something in the political realm but did not know exactly what question I would ask. I reached out to a General Assembly instructor out of D.C who had some experience working with data for political campaigns. He led me in the direction of doing something that involves voter registration information and looking at the voter registration file. I was able to get the Colorado voter registration file online for free, which I will explain later on in the blog, and started to play around in that. The two columns that peaked my interest were “EFFECTIVE_DATE” and “REGISTRATION_DATE.” “EFFECTIVE_DATE” is defined as the “Date of last change to the voter record affecting one or more of the following: residential address, party affiliation, status.” “REGISTRATION_DATE” is defined as the “Date of registration within the county.” What I found most interesting was if and how often these two dates differed at all. My next thought was to find how much information I could get about voter registration details. I went to the Colorado Secretary of States website and was able to find everything I was looking for. There is a breakdown of voter registration numbers by county or district by month of every year going back to 2004. This begged me to ask the question how did those change over time and what factor could influence them to change.

Now I had two different questions that had were connected but different enough to explain. I only needed the factor that would influence registration numbers to change. What is the biggest and most quantifiable aspect of politics? The election. I am going to see how election results affect the following year’s registration numbers. That was it. I had my capstone project.

Data Collection and Minor Exploration:

The first data file I collected was the Colorado voter registration file from August of 2017. I found this for free online at http://coloradovoters.info/. They have Colorado voter registration files going back to 2013 pulled a few times a year. The Colorado voter registration file is a file of every single voter in Colorado, roughly 3.7 million people. There are 50 possible data entries about each person in the file. The columns in the data set are in the picture below.

Pulling Columns names out of a Pandas DataFrame

Not all of these are filled. For example, most people do not have 3 mailing addresses so “MAIL_ADDR2” and “MAIL_ADDR3” are blank. When I get into serious EDA and feature creation I will be dropping most of these columns and only focusing on a few of them.

It has been fun to search around for this file. I have been looking up my friends and family for the pure enjoyment of looking at their information and telling it to them. Some of them thought it was a little creepy. I found out of my friends does not use his middle name for his registration and my Grandpa’s best friend only uses his middle initial as his middle name.

The next data I came across is from the Colorado Secretary of States website, http://www.sos.state.co.us/. This data included all of the registration stats by each county per active, inactive and pre-reg per each political party in Colorado. Below is what the header of the excel file looks like.

This extends for all 64 Colorado counties.

There is an excel file like this one for every month of the year. I will be using these to pull the registration numbers for Democrats, Republicans, and Independents for every month. I am going to lump the other political parties in with Independents to simplify my models a little. Because these excel sheets are formatted I am going to need to write a function to turn this information into something that Pandas can turn into a DataFrame. I will be going through that process in future blog posts.

The last data know I need to collect is election results for the top of the ticket elections for elections in Colorado by county. The Daily Kos has this broken out nicely in an exportable format that I will be able to use. I am going to focus on 2012, 2014 and 2016 elections because they have big top of the ticket names on them and tend to drive a lot of conversation and momentum for politics afterward.

Now that I know where most of my data is going to come from the next step is going to be to get the data is a form that can be used in python, get it loaded into a data-frame or two then, start major EDA and feature creation.

--

--

Christopher Johnson

Data Scientist and Data Analyst trying out new techniques and always exploring new datasets. https://www.linkedin.com/in/christopher-johnson/