FEC Dataset
## 'data.frame': 167259 obs. of 18 variables:
## $ cmte_id : Factor w/ 24 levels "C00458844","C00500587",..: 15 15 15 15 15 6 12 7 6 7 ...
## $ cand_id : Factor w/ 24 levels "P00003392","P20002671",..: 23 23 23 23 23 1 15 12 1 12 ...
## $ cand_nm : Factor w/ 24 levels "Bush, Jeb","Carson, Benjamin S.",..: 22 22 22 22 22 4 14 19 4 19 ...
## $ contbr_nm : Factor w/ 44968 levels "'CALLAHAN, PAMELA",..: 35963 35967 35975 33864 33867 17322 8617 16139 36641 25134 ...
## $ contbr_city : Factor w/ 1360 levels " BATAVIA","45320",..: 244 708 248 231 193 274 589 274 596 231 ...
## $ contbr_st : Factor w/ 1 level "OH": 1 1 1 1 1 1 1 1 1 1 ...
## $ contbr_zip : int 45315 44060 44106 45208 44721 432141210 441071232 432022420 450365038 45249 ...
## $ contbr_employer : Factor w/ 13494 levels "","-","(SCHOOL)",..: 5791 9973 5791 1283 10550 5791 3169 8464 9973 10547 ...
## $ contbr_occupation: Factor w/ 6691 levels "","-"," CERTIFIED REGISTERED NURSE ANESTHETIS",..: 2889 5134 2889 3503 5465 2889 3209 3884 6054 403 ...
## $ contb_receipt_amt: num 97.1 53.5 69.4 88.4 -80 ...
## $ contb_receipt_dt : Factor w/ 685 levels "01-Apr-15","01-Apr-16",..: 510 21 374 327 285 244 675 120 605 100 ...
## $ receipt_desc : Factor w/ 25 levels ""," SEE REATTRIBUTION",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ memo_cd : Factor w/ 2 levels "","X": 2 2 2 2 2 2 1 1 2 1 ...
## $ memo_text : Factor w/ 89 levels ""," SEE REATTRIBUTION",..: 1 1 1 1 1 19 1 10 19 10 ...
## $ form_tp : Factor w/ 3 levels "SA17A","SA18",..: 2 2 2 2 2 2 1 1 2 1 ...
## $ file_num : int 1146165 1146165 1146165 1146165 1146165 1091718 1144564 1077404 1091718 1077404 ...
## $ tran_id : Factor w/ 166816 levels "A0000FD2A304E432AAD5",..: 108798 118153 112319 108928 120998 54281 166784 144656 54504 144411 ...
## $ election_tp : Factor w/ 4 levels "","G2016","O2016",..: 2 2 2 2 2 4 4 4 4 4 ...
## [1] 0
## [1] 3
The FEC dataset was obviously large (167,259 records by 18 columns), so I looked through the columns to determine what they were referencing and decide whether I needed them all.
The first three columns (cmte_id, cand_id, cand_nm) were all factor/categorical variables with the same number of levels (24). Essentially, they were three different ways of capturing the same information - which candidate is receiving the money? I really only needed one of these, so I decided to keep the one that was easiest for me to use, the candidate’s name (cand_nm).
The data also had three location columns - contbr_city, contbr_st, contbr_zip. These referred to the city of the contributor’s city, their state and their zipcode. Because the data is from Ohio, the state information was the same for each record. I was planning on using the zipcode information (contbr_zip) for matching (even though it had three missing values), so I needed to keep that. However, the names of the cities are much easier to understand when classifying votes to a location, so I decided to keep both.
I did notice that some of the zipcodes did have values longer than 5 (typical length of a US zipcode) and knew that I would need to investigate why this was the case.
The dataset also provides employment information about the contributor’s employer (contbr_employer) and their occupation(contbr_occupation). While this could be interesting information for another analysis, I was not interested in exploring the characteristics of the contributors to achieve my goal, so these were not needed.
There were also columns for the amount/value (in dollars) of the contribution (contb_receipt_amt), and the date (contb_receipt_dt). These were definitely important pieces of information that I wanted to keep. The date information was not in a date format (which I would need if I wanted to comparisons over time), so this would need to be changed.
The form type (form_tp) and file number (file_num) didn’t seem important for my purposes.
I wasn’t sure what the election_tp
column was referring to, so I explored it.
##
## G2016 O2016 P2016
## 393 59160 20 107686
I discovered that it referred to election type. It used three different codes (G2016, P2016, O2016) to refer to the type of election to which the contribution was made. Some records (393) did not have election type information.
The codes refer to one of three types of elections towards which the contributions could be made.
- G2016: General Election - an election to fill public offices (in our case the election of the presidential candidates)
- P2016: Primary Election - an election prior to the general election in which voters select the candidates who will run on each party’s ticket
- O2016: Open Primary - a subset of a Primary Election where voters to choose on Election Day the party primary for which they wish to vote
I decided that it was important to keep the information for both General and Primary elections, and did not keep any information that didn’t have this assigned. I also decided to drop records for contributions to a Open Primary because I felt that the interpretation of this information could be more difficult. (Whereas for the regular General and Primary election codes, the intended party of the support is clear)
There were a number of other columns - receipt_desc, memo_cd, memo_text and tran_id. Of these, the only one that I decided was relevant was transaction ID (tran_id) to help confirm again that there were no duplicate records.
Based on my exploration, I decided to include the following columns in the final dataset.
- cand_nm
- contbr_nm
- contbr_city
- contbr_zip
- contb_receipt_amt
- contb_receipt_dt
- tran_id
- election_tp
I then completed all of the subsetting that was needed to select the information I had chosen to include, and changed the data formats that I had identified needed modification.
## [1] 226
## 'data.frame': 166620 obs. of 8 variables:
## $ cand_nm : Factor w/ 24 levels "Bush, Jeb","Carson, Benjamin S.",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ contbr_nm : Factor w/ 44968 levels "'CALLAHAN, PAMELA",..: 44229 27486 14945 30460 19462 27603 17173 17173 30646 23830 ...
## $ contbr_city : Factor w/ 1360 levels " BATAVIA","45320",..: 346 130 1356 989 274 818 833 833 1289 231 ...
## $ contbr_zip : int 430168453 44512 445121402 442661010 432051112 458698604 446631328 446631328 450697039 452192420 ...
## $ contb_receipt_amt: num 250 500 250 500 250 ...
## $ contb_receipt_dt : Date, format: "2014-07-17" "2014-09-18" ...
## $ tran_id : Factor w/ 166816 levels "A0000FD2A304E432AAD5",..: 1410 1785 4606 23 588 1028 3266 1265 5254 2796 ...
## $ election_tp : Factor w/ 4 levels "","G2016","O2016",..: 4 4 4 4 4 4 4 4 4 4 ...
With the removal of the undesired data, an additional test of duplicates revealed 226 duplicated values. These were also removed. This reduced the number of columns to 8 and the number of observations to 166,620.
To investigate zipcodes with more than five digits, I read this article. I discovered that the long digits do include relevant information but because geographical distances were not planned to be used in predictions I decided that I didn’t need that level of precision and decided to only use the first five digits.
I also noticed that there were negative values for some contributions and decided to investigate using Python (as I found it easier to manipulate the data in a Jupyter Notebook).
Based on the information in receipt_desc
these values were for refunds or reassignments to a spouse or a different election (e.g. from Primary to General).
I was originally planning to remove only the refunded transactions and their corresponding initial contributions but discovered that the refunding of transactions was more complicated than I expected. At times there was a straight refund, but for most transactions, multiple steps were involved in the refunding. These processes included reassigning the funds to a diferent election (e.g. from primary to general election, or from general election to a senate race), or reassigning to a spouse. In the majority of cases the value of the refund did not match the initial value and so correctly matching refunds to transactions would have been a challenging task.
I compared the spread of this data for the size of the contributions per candidate between the full dataset and the refunded donations dataset. I decided that there was sufficient similarity in this information to simply remove all records with names of contributors with refunds (or contributions of 0). I used Python to conduct the analysis but removed the unnecessary data with R.
This reduced the number of records by approximately 10,000 records (5.8% of the total).
I also decided to add a column for the candidates’ party alignment (because voting patterns often follow party alignment).
## Factor w/ 3 levels "democrat","Other",..: 3 3 3 3 3 3 3 3 3 3 ...
I then added the zipcode information to the main dataset.
## [1] 0.0007073984
This left some records with missing coordiante information, but the zipcodes were not clearly identifiable as coming from Ohio and were a very low proportion of the data (Less than 0.1%). I decided to remove them, this left 156,815 contribution records.
I then converted the county names to lower case and the names of all of the columns in the dataset to lower case.
I called this cleaned version of the original dataset ohio.2016.
Countyzip Dataset
I converted the county names to lower case and merged with the ohio.2016 dataset.
I then merged the election dataset with the countzip dataset on county (after changing the election county information to lower case as well).
## county candidate percent_vote count_vote ZIP.Code
## 1 adams Trump, Donald J. 76.3 8445 45105
## 2 adams Clinton, Hillary Rodham 20.7 2293 45105
## 3 adams Johnson, Gary 2.0 220 45105
## 4 adams Duncan, R 0.6 62 45105
## 5 adams Stein, Jill 0.4 43 45105
## 6 adams Trump, Donald J. 76.3 8445 45144
## 7 adams Clinton, Hillary Rodham 20.7 2293 45144
## 8 adams Johnson, Gary 2.0 220 45144
## 9 adams Duncan, R 0.6 62 45144
## 10 adams Stein, Jill 0.4 43 45144
This resulted in two final datasets, the main dataset ohio.2016 with 156,803 records and the following columns:
- contbr_zip: 5 digit contributor zipcode
- cand_nm: the name of the candidate they contributed to
- contbr_nm: the name of the contributor
- contbr_city: the city of the contributor
- contb_receipt_amt: the amount of the contribution
- contb_receipt_date: the date of the contribution
- tran_id: transaction id, primary to ensure unique transactions
- election_tp: election type - general election, or primary
- cand_party: the party of the candicate as democrat, republican or other
- lat: latitude coordinates for the zipcode
- long: longditude coordinate for the zipcode
- county: county of the contributor
and, the results dataset with 6,655 rows and the following columns:
- county: the reporting county
- candidate: the names of the five presidential candidates
- percent_vote: the percentage of the vote that they won for the county
- count_vote: the count of the votes that they won for the county
- ZIP.Code: the zipcodes for each county
The the results dataset is arranged in long-format so that there are multiple entries for each zipcode, that provide the county and candidate information. If I had merged this information with the ohio.2016 dataset it would have duplicated the individual contributions to match with the zipcodes. I didn’t want this to occur so I kept them separate for the time being.