1 Introduction

On the evening of November 8, 2016 many people in North America were surprised when it became increasingly clear that Donald Trump would convincingly win the election for President of the United States. It wasn’t just people sitting in their armchairs at home, it was also political pundits, many of whom had talked about Hillary Clinton leading the polls as the election date drew closer.

One question that has been asked is why did so many political analysts get their predictions wrong? Was the information necessary to make the correct predictions available but simply ignored?

To answer this question, I turned to Ohio. Ohio is considered a bellwether state for presidential elections. It has the longest running streak of backing the elected candidate in a presidential race of all US states, with a perfect record since 1964. It also has the highest percentage of correct backings since 1896 and on average, has the lowest deviation from the national results of vote percentage differences.

Within Ohio, there are also bellwether counties:

Ottawa County - one miss since 1948 (in 1960), perfect since 1964
Wood County - one miss since 1964 (in 1976), perfect since 1980]
Lake County - two misses since 1952 (in 1992 and 2012)
Stark County - two misses since 1964 (in 1976 and 2004)
Sandusky County - three misses since 1952 (in 1960, 1976 and 1992)
Tuscarawas County - three misses since 1912 (in 1960, 1968, and 2012)

As a bellwether state, can patterns in Ohio contributions to the presidential election campaigns give us information that would improve predictions of candidate success?

That’s what I plan to find out! Sit back, strap on your seat belt, and let's take a journey to see what we can discover in this world of intrigue! :D (Also, if you're looking for a slightly more 'light-touch' version of this journey, you can probably skip down to the Multivariate Exploration using the navigation bar because that's where things start to get really interesting.)

1.1 The Datasets

For those of you that are here for the full journey, let's dig in! To start off, I tracked down a number of different datasets to help me explore the question, "Did we have available data to predict the results of the 2016 presidential election?"

The first one came straight from the Federal Election Commision’s data of 2016 Presidential Campaign Finances.

## [1] 167259     18

## 'data.frame':    167259 obs. of  18 variables:
##  $ cmte_id          : Factor w/ 24 levels "C00458844","C00500587",..: 15 15 15 15 15 6 12 7 6 7 ...
##  $ cand_id          : Factor w/ 24 levels "P00003392","P20002671",..: 23 23 23 23 23 1 15 12 1 12 ...
##  $ cand_nm          : Factor w/ 24 levels "Bush, Jeb","Carson, Benjamin S.",..: 22 22 22 22 22 4 14 19 4 19 ...
##  $ contbr_nm        : Factor w/ 44968 levels "'CALLAHAN, PAMELA",..: 35963 35967 35975 33864 33867 17322 8617 16139 36641 25134 ...
##  $ contbr_city      : Factor w/ 1360 levels " BATAVIA","45320",..: 244 708 248 231 193 274 589 274 596 231 ...
##  $ contbr_st        : Factor w/ 1 level "OH": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : int  45315 44060 44106 45208 44721 432141210 441071232 432022420 450365038 45249 ...
##  $ contbr_employer  : Factor w/ 13494 levels "","-","(SCHOOL)",..: 5791 9973 5791 1283 10550 5791 3169 8464 9973 10547 ...
##  $ contbr_occupation: Factor w/ 6691 levels "","-"," CERTIFIED REGISTERED NURSE ANESTHETIS",..: 2889 5134 2889 3503 5465 2889 3209 3884 6054 403 ...
##  $ contb_receipt_amt: num  97.1 53.5 69.4 88.4 -80 ...
##  $ contb_receipt_dt : Factor w/ 685 levels "01-Apr-15","01-Apr-16",..: 510 21 374 327 285 244 675 120 605 100 ...
##  $ receipt_desc     : Factor w/ 25 levels ""," SEE REATTRIBUTION",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_cd          : Factor w/ 2 levels "","X": 2 2 2 2 2 2 1 1 2 1 ...
##  $ memo_text        : Factor w/ 89 levels ""," SEE REATTRIBUTION",..: 1 1 1 1 1 19 1 10 19 10 ...
##  $ form_tp          : Factor w/ 3 levels "SA17A","SA18",..: 2 2 2 2 2 2 1 1 2 1 ...
##  $ file_num         : int  1146165 1146165 1146165 1146165 1146165 1091718 1144564 1077404 1091718 1077404 ...
##  $ tran_id          : Factor w/ 166816 levels "A0000FD2A304E432AAD5",..: 108798 118153 112319 108928 120998 54281 166784 144656 54504 144411 ...
##  $ election_tp      : Factor w/ 4 levels "","G2016","O2016",..: 2 2 2 2 2 4 4 4 4 4 ...

##     cmte_id   cand_id                 cand_nm       contbr_nm
## 1 C00580100 P80001571        Trump, Donald J.      SELL, GREG
## 2 C00580100 P80001571        Trump, Donald J.     SELLE, JOAN
## 3 C00580100 P80001571        Trump, Donald J.    SELLERS, JES
## 4 C00580100 P80001571        Trump, Donald J.  ROOTRING, BEAU
## 5 C00580100 P80001571        Trump, Donald J. ROPE, SANDRA MS
## 6 C00575795 P00003392 Clinton, Hillary Rodham     HILSON, ANN
##     contbr_city contbr_st contbr_zip                 contbr_employer
## 1       CLAYTON        OH      45315           INFORMATION REQUESTED
## 2        MENTOR        OH      44060                         RETIRED
## 3 CLEVELAND HTS        OH      44106           INFORMATION REQUESTED
## 4    CINCINNATI        OH      45208 BGR CONSUMER UNDERSTANDING, LLC
## 5        CANTON        OH      44721                   SELF-EMPLOYED
## 6      COLUMBUS        OH  432141210           INFORMATION REQUESTED
##       contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc
## 1 INFORMATION REQUESTED             97.07        23-Sep-16
## 2               RETIRED             53.54        01-Sep-16
## 3 INFORMATION REQUESTED             69.40        17-Oct-16
## 4       MARKET RESEARCH             88.37        15-Nov-16
## 5         SELF-EMPLOYED            -80.00        13-Sep-16
## 6 INFORMATION REQUESTED             40.00        12-Apr-16
##   memo_cd              memo_text form_tp file_num     tran_id election_tp
## 1       X                           SA18  1146165 SA18.102871       G2016
## 2       X                           SA18  1146165 SA18.165861       G2016
## 3       X                           SA18  1146165 SA18.143767       G2016
## 4       X                           SA18  1146165 SA18.110145       G2016
## 5       X                           SA18  1146165 SA18.203975       G2016
## 6       X * HILLARY VICTORY FUND    SA18  1091718    C4715778       P2016

## [1] 0

It had information on 167,259 contributions made to candidates within the 2016 election cycle across 18 different variables. I did an initial check to confirm that there were no duplicate records and none were found.

The second dataset came from this site that connects zipcodes to latitude and longitude values. I wanted this dataset because I was hoping to do some geographic mapping of contributions. The dataset from the FEC had zipcodes, but I needed to connect this to coordinates for latitude and longditude.

## 'data.frame':    42522 obs. of  12 variables:
##  $ Zipcode            : int  705 610 611 612 601 631 602 603 604 605 ...
##  $ ZipCodeType        : Factor w/ 4 levels "MILITARY","PO BOX",..: 3 3 2 3 3 2 3 3 2 2 ...
##  $ City               : Factor w/ 18758 levels "AARONSBURG","ABBEVILLE",..: 108 363 388 506 73 2581 96 97 97 97 ...
##  $ State              : Factor w/ 62 levels "AA","AE","AK",..: 48 48 48 48 48 48 48 48 48 48 ...
##  $ LocationType       : Factor w/ 1 level "PRIMARY": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lat                : num  18.1 18.3 18.3 18.4 18.2 ...
##  $ Long               : num  -66.3 -67.1 -66.8 -66.7 -66.7 ...
##  $ Location           : Factor w/ 30127 levels "","AF-DJ-CAMP LEMONIER",..: 23640 23641 23642 23643 23635 23655 23636 23637 23637 23637 ...
##  $ Decommisioned      : Factor w/ 2 levels "false","true": 1 1 1 1 1 1 1 1 1 1 ...
##  $ TaxReturnsFiled    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ EstimatedPopulation: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TotalWages         : int  NA NA NA NA NA NA NA NA NA NA ...

##   Zipcode ZipCodeType     City State LocationType   Lat   Long
## 1     705    STANDARD AIBONITO    PR      PRIMARY 18.14 -66.26
## 2     610    STANDARD   ANASCO    PR      PRIMARY 18.28 -67.14
## 3     611      PO BOX  ANGELES    PR      PRIMARY 18.28 -66.79
## 4     612    STANDARD  ARECIBO    PR      PRIMARY 18.45 -66.73
## 5     601    STANDARD ADJUNTAS    PR      PRIMARY 18.16 -66.72
## 6     631      PO BOX CASTANER    PR      PRIMARY 18.19 -66.82
##            Location Decommisioned TaxReturnsFiled EstimatedPopulation
## 1 NA-US-PR-AIBONITO         false              NA                  NA
## 2   NA-US-PR-ANASCO         false              NA                  NA
## 3  NA-US-PR-ANGELES         false              NA                  NA
## 4  NA-US-PR-ARECIBO         false              NA                  NA
## 5 NA-US-PR-ADJUNTAS         false              NA                  NA
## 6 NA-US-PR-CASTANER         false              NA                  NA
##   TotalWages
## 1         NA
## 2         NA
## 3         NA
## 4         NA
## 5         NA
## 6         NA

The dataset did have a lot of interesting information, but rather than getting caught in the weeds, I decided to stick with just incorporating the coordinate data with FEC data. I also double checked that the data was in a usable format. The zipcodes were of integer format and the coordiantes were numerical format. This meant that I didn’t have to make any changes to the format - the zipcode information for this dataset and the FEC dataset were the same, so I could use these zipcodes for matching with the FEC zipcodes, and the coordinates were decimal numbers, which is what I would need for plotting.

The next dataset I needed was one containing the General Election results for each of the Ohio counties. I created this myself borrowing the data from Politico.

## [1] 440   4

## 'data.frame':    440 obs. of  4 variables:
##  $ county      : Factor w/ 88 levels "Adams","Allen",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ candidate   : Factor w/ 5 levels "Clinton, Hillary Rodham",..: 5 1 3 2 4 5 1 3 4 2 ...
##  $ percent_vote: num  76.3 20.7 2 0.6 0.4 66.9 28.7 3.2 0.7 0.5 ...
##  $ count_vote  : int  8445 2293 220 62 43 29858 12815 1440 310 213 ...

##   county               candidate percent_vote count_vote
## 1  Adams        Trump, Donald J.         76.3       8445
## 2  Adams Clinton, Hillary Rodham         20.7       2293
## 3  Adams           Johnson, Gary          2.0        220
## 4  Adams               Duncan, R          0.6         62
## 5  Adams             Stein, Jill          0.4         43
## 6  Allen        Trump, Donald J.         66.9      29858

This dataset lists each county (88 of them), the five candidates who ran for president in 2016 for each county, and the number of votes they received in the general election for the county, and what percentage of the votes this equated to. I kept both the count and the percentage information because I wasn’t sure what was most useful. Because I built it, I made sure that the candidates’ names were in the same format as was found in the FEC dataset to ensure that I could match on candidate names across datasets.

The final dataset allowed me to connect election results information to the FEC data. The FEC data had zipcodes but not counties, and the elections results data only had counties. I used a site that listed all of the zipcodes associated with each county so that I could use this information to link the FEC data to the election results.

## [1] 2790    3

## 'data.frame':    2790 obs. of  3 variables:
##  $ City    : Factor w/ 2682 levels "Aberdeen","Academia",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ZIP.Code: Factor w/ 1329 levels "43001","43002",..: 912 41 1252 854 938 375 369 1139 870 71 ...
##  $ County  : Factor w/ 88 levels "Adams","Allen",..: 8 42 33 70 1 60 60 27 31 71 ...

##           City ZIP.Code    County
## 1     Aberdeen    45101     Brown
## 2     Academia    43050      Knox
## 3          Ada    45810    Hardin
## 4       Adairo    44878  Richland
## 5 Adams County    45144     Adams
## 6  Adams Mills    43821 Muskingum

All of the county names from this site were capitalized, so I will need to change the datasets so that they are all the same type of capitalization.

2 Data Cleaning

2.1 FEC Dataset

## 'data.frame':    167259 obs. of  18 variables:
##  $ cmte_id          : Factor w/ 24 levels "C00458844","C00500587",..: 15 15 15 15 15 6 12 7 6 7 ...
##  $ cand_id          : Factor w/ 24 levels "P00003392","P20002671",..: 23 23 23 23 23 1 15 12 1 12 ...
##  $ cand_nm          : Factor w/ 24 levels "Bush, Jeb","Carson, Benjamin S.",..: 22 22 22 22 22 4 14 19 4 19 ...
##  $ contbr_nm        : Factor w/ 44968 levels "'CALLAHAN, PAMELA",..: 35963 35967 35975 33864 33867 17322 8617 16139 36641 25134 ...
##  $ contbr_city      : Factor w/ 1360 levels " BATAVIA","45320",..: 244 708 248 231 193 274 589 274 596 231 ...
##  $ contbr_st        : Factor w/ 1 level "OH": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : int  45315 44060 44106 45208 44721 432141210 441071232 432022420 450365038 45249 ...
##  $ contbr_employer  : Factor w/ 13494 levels "","-","(SCHOOL)",..: 5791 9973 5791 1283 10550 5791 3169 8464 9973 10547 ...
##  $ contbr_occupation: Factor w/ 6691 levels "","-"," CERTIFIED REGISTERED NURSE ANESTHETIS",..: 2889 5134 2889 3503 5465 2889 3209 3884 6054 403 ...
##  $ contb_receipt_amt: num  97.1 53.5 69.4 88.4 -80 ...
##  $ contb_receipt_dt : Factor w/ 685 levels "01-Apr-15","01-Apr-16",..: 510 21 374 327 285 244 675 120 605 100 ...
##  $ receipt_desc     : Factor w/ 25 levels ""," SEE REATTRIBUTION",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_cd          : Factor w/ 2 levels "","X": 2 2 2 2 2 2 1 1 2 1 ...
##  $ memo_text        : Factor w/ 89 levels ""," SEE REATTRIBUTION",..: 1 1 1 1 1 19 1 10 19 10 ...
##  $ form_tp          : Factor w/ 3 levels "SA17A","SA18",..: 2 2 2 2 2 2 1 1 2 1 ...
##  $ file_num         : int  1146165 1146165 1146165 1146165 1146165 1091718 1144564 1077404 1091718 1077404 ...
##  $ tran_id          : Factor w/ 166816 levels "A0000FD2A304E432AAD5",..: 108798 118153 112319 108928 120998 54281 166784 144656 54504 144411 ...
##  $ election_tp      : Factor w/ 4 levels "","G2016","O2016",..: 2 2 2 2 2 4 4 4 4 4 ...

## [1] 0

## [1] 3

The FEC dataset was obviously large (167,259 records by 18 columns), so I looked through the columns to determine what they were referencing and decide whether I needed them all.

The first three columns (cmte_id, cand_id, cand_nm) were all factor/categorical variables with the same number of levels (24). Essentially, they were three different ways of capturing the same information - which candidate is receiving the money? I really only needed one of these, so I decided to keep the one that was easiest for me to use, the candidate’s name (cand_nm).

The data also had three location columns - contbr_city, contbr_st, contbr_zip. These referred to the city of the contributor’s city, their state and their zipcode. Because the data is from Ohio, the state information was the same for each record. I was planning on using the zipcode information (contbr_zip) for matching (even though it had three missing values), so I needed to keep that. However, the names of the cities are much easier to understand when classifying votes to a location, so I decided to keep both.

I did notice that some of the zipcodes did have values longer than 5 (typical length of a US zipcode) and knew that I would need to investigate why this was the case.

The dataset also provides employment information about the contributor’s employer (contbr_employer) and their occupation(contbr_occupation). While this could be interesting information for another analysis, I was not interested in exploring the characteristics of the contributors to achieve my goal, so these were not needed.

There were also columns for the amount/value (in dollars) of the contribution (contb_receipt_amt), and the date (contb_receipt_dt). These were definitely important pieces of information that I wanted to keep. The date information was not in a date format (which I would need if I wanted to comparisons over time), so this would need to be changed.

The form type (form_tp) and file number (file_num) didn’t seem important for my purposes.

I wasn’t sure what the election_tp column was referring to, so I explored it.

##
##         G2016  O2016  P2016
##    393  59160     20 107686

I discovered that it referred to election type. It used three different codes (G2016, P2016, O2016) to refer to the type of election to which the contribution was made. Some records (393) did not have election type information.

The codes refer to one of three types of elections towards which the contributions could be made.

G2016: General Election - an election to fill public offices (in our case the election of the presidential candidates)
P2016: Primary Election - an election prior to the general election in which voters select the candidates who will run on each party’s ticket
O2016: Open Primary - a subset of a Primary Election where voters to choose on Election Day the party primary for which they wish to vote

I decided that it was important to keep the information for both General and Primary elections, and did not keep any information that didn’t have this assigned. I also decided to drop records for contributions to a Open Primary because I felt that the interpretation of this information could be more difficult. (Whereas for the regular General and Primary election codes, the intended party of the support is clear)

There were a number of other columns - receipt_desc, memo_cd, memo_text and tran_id. Of these, the only one that I decided was relevant was transaction ID (tran_id) to help confirm again that there were no duplicate records.

Based on my exploration, I decided to include the following columns in the final dataset.

cand_nm
contbr_nm
contbr_city
contbr_zip
contb_receipt_amt
contb_receipt_dt
tran_id
election_tp

I then completed all of the subsetting that was needed to select the information I had chosen to include, and changed the data formats that I had identified needed modification.

## [1] 226

## 'data.frame':    166620 obs. of  8 variables:
##  $ cand_nm          : Factor w/ 24 levels "Bush, Jeb","Carson, Benjamin S.",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ contbr_nm        : Factor w/ 44968 levels "'CALLAHAN, PAMELA",..: 44229 27486 14945 30460 19462 27603 17173 17173 30646 23830 ...
##  $ contbr_city      : Factor w/ 1360 levels " BATAVIA","45320",..: 346 130 1356 989 274 818 833 833 1289 231 ...
##  $ contbr_zip       : int  430168453 44512 445121402 442661010 432051112 458698604 446631328 446631328 450697039 452192420 ...
##  $ contb_receipt_amt: num  250 500 250 500 250 ...
##  $ contb_receipt_dt : Date, format: "2014-07-17" "2014-09-18" ...
##  $ tran_id          : Factor w/ 166816 levels "A0000FD2A304E432AAD5",..: 1410 1785 4606 23 588 1028 3266 1265 5254 2796 ...
##  $ election_tp      : Factor w/ 4 levels "","G2016","O2016",..: 4 4 4 4 4 4 4 4 4 4 ...

With the removal of the undesired data, an additional test of duplicates revealed 226 duplicated values. These were also removed. This reduced the number of columns to 8 and the number of observations to 166,620.

To investigate zipcodes with more than five digits, I read this article. I discovered that the long digits do include relevant information but because geographical distances were not planned to be used in predictions I decided that I didn’t need that level of precision and decided to only use the first five digits.

I also noticed that there were negative values for some contributions and decided to investigate using Python (as I found it easier to manipulate the data in a Jupyter Notebook).

Based on the information in receipt_desc these values were for refunds or reassignments to a spouse or a different election (e.g. from Primary to General).

I was originally planning to remove only the refunded transactions and their corresponding initial contributions but discovered that the refunding of transactions was more complicated than I expected. At times there was a straight refund, but for most transactions, multiple steps were involved in the refunding. These processes included reassigning the funds to a diferent election (e.g. from primary to general election, or from general election to a senate race), or reassigning to a spouse. In the majority of cases the value of the refund did not match the initial value and so correctly matching refunds to transactions would have been a challenging task.

I compared the spread of this data for the size of the contributions per candidate between the full dataset and the refunded donations dataset. I decided that there was sufficient similarity in this information to simply remove all records with names of contributors with refunds (or contributions of 0). I used Python to conduct the analysis but removed the unnecessary data with R.

This reduced the number of records by approximately 10,000 records (5.8% of the total).

I also decided to add a column for the candidates’ party alignment (because voting patterns often follow party alignment).

##  Factor w/ 3 levels "democrat","Other",..: 3 3 3 3 3 3 3 3 3 3 ...

I then added the zipcode information to the main dataset.

## [1] 0.0007073984

This left some records with missing coordiante information, but the zipcodes were not clearly identifiable as coming from Ohio and were a very low proportion of the data (Less than 0.1%). I decided to remove them, this left 156,815 contribution records.

I then converted the county names to lower case and the names of all of the columns in the dataset to lower case.

I called this cleaned version of the original dataset ohio.2016.

2.2 Countyzip Dataset

I converted the county names to lower case and merged with the ohio.2016 dataset.

I then merged the election dataset with the countzip dataset on county (after changing the election county information to lower case as well).

##    county               candidate percent_vote count_vote ZIP.Code
## 1   adams        Trump, Donald J.         76.3       8445    45105
## 2   adams Clinton, Hillary Rodham         20.7       2293    45105
## 3   adams           Johnson, Gary          2.0        220    45105
## 4   adams               Duncan, R          0.6         62    45105
## 5   adams             Stein, Jill          0.4         43    45105
## 6   adams        Trump, Donald J.         76.3       8445    45144
## 7   adams Clinton, Hillary Rodham         20.7       2293    45144
## 8   adams           Johnson, Gary          2.0        220    45144
## 9   adams               Duncan, R          0.6         62    45144
## 10  adams             Stein, Jill          0.4         43    45144

This resulted in two final datasets, the main dataset ohio.2016 with 156,803 records and the following columns:

contbr_zip: 5 digit contributor zipcode
cand_nm: the name of the candidate they contributed to
contbr_nm: the name of the contributor
contbr_city: the city of the contributor
contb_receipt_amt: the amount of the contribution
contb_receipt_date: the date of the contribution
tran_id: transaction id, primary to ensure unique transactions
election_tp: election type - general election, or primary
cand_party: the party of the candicate as democrat, republican or other
lat: latitude coordinates for the zipcode
long: longditude coordinate for the zipcode
county: county of the contributor

and, the results dataset with 6,655 rows and the following columns:

county: the reporting county
candidate: the names of the five presidential candidates
percent_vote: the percentage of the vote that they won for the county
count_vote: the count of the votes that they won for the county
ZIP.Code: the zipcodes for each county

The the results dataset is arranged in long-format so that there are multiple entries for each zipcode, that provide the county and candidate information. If I had merged this information with the ohio.2016 dataset it would have duplicated the individual contributions to match with the zipcodes. I didn’t want this to occur so I kept them separate for the time being.

3 Univariate Exploration

Ok, cleaning's all done - are you still with me? Now we can start to getting into the fun part of really seeing what we have!

3.1 Candidates

##
##                 Bush, Jeb       Carson, Benjamin S.
##                       261                      7657
##  Christie, Christopher J.   Clinton, Hillary Rodham
##                        22                     67229
## Cruz, Rafael Edward 'Ted'            Fiorina, Carly
##                     14877                       617
##        Graham, Lindsey O.            Huckabee, Mike
##                        33                       146
##             Jindal, Bobby             Johnson, Gary
##                         9                       312
##           Kasich, John R.          Lessig, Lawrence
##                      4574                        22
##            McMullin, Evan   O'Malley, Martin Joseph
##                        15                        27
##         Pataki, George E.                Paul, Rand
##                         2                       727
##    Perry, James R. (Rick)              Rubio, Marco
##                        10                      2293
##          Sanders, Bernard      Santorum, Richard J.
##                     32208                        79
##               Stein, Jill          Trump, Donald J.
##                        94                     25482
##             Walker, Scott     Webb, James Henry Jr.
##                        99                         8

Eleven candidates received over 100 contributions from Ohio voters. These were

Jeb Bush
Ben Carson
Hillary Clinton
Ted Cruz
Gary Johnson
John Kasich
Rand Paul
Marco Rubio
Bernie Sanders
Donald Trump.

Hillary Clinton, Bernie Sanders, and then Donald Trump received the largest number of contributions.

This result was suprising to me. As a state, Ohio supported Trump as the presidential candidate, but based on the number of contributions, there is much more support for Clinton. I wondered if looking at the value of the contributions per candidate would provide further insight. I also wondered how contributions over time, and the differences between the primary and general elections would also modify this apparent support for Clinton.

3.2 Parties

The spread of contributions across parties was as follows.

##
##   democrat      Other republican
##      99494        421      56888

There were a lot more republican candidates than democratic candidates and so I wondered if the surprising number of contributions for Clinton would be balanced out at the party level.

As you can see, this was not the case. The largest number of contributions were made to democratic candidates (99,494) compared to just over half as many (56,888) contributions made to republican candidates. You can see that very few contributions were made to candidates not from one of the major two parties.

So even comparing democratic and Rebublican candidates combined, the large number of republican candidates didn’t mean overall there were more contributions for republican candidates.

3.3 Contributions per Election Type

##
##         G2016  O2016  P2016
##      0  55625      0 101178

I figured that there could be differences in the number of contributions made towards each election type (general and primary). I hadn’t expected to find so many more contributions for the primary election (101,178) than the general election (55,625).

3.4 Contributions per Contributor

## [1] 43683

Again, contributor characteristics were not something that I was intending to include in my model for predicting election results, but I thought it might be a good idea to check the information, just in case something significant was identified.

There were 43,683 unique contributors in the ohio.2016 dataset. The contributors who made the largest number of contributions each made more than 100 contributions. Two individuals made over 150 contributions.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    1.00    1.00    1.00    3.59    3.00  179.00

The contribution numbers of the contributors with the highest numbers of individual contributions were very different from how contributions were made in the general population.

I limited the information to show how the bottom 95% of contributors (in terms of number of contributions) made their contributions.

The majority of people made 3 contributions or less (more than half made just one), and most contributors made 13 contributions or less.

While this is interesting, number of contributions per contributor is not my main focus, at this point in time I haven't seen anything that suggests that it needs to be included in future explorations of the data.

3.5 Contributions per City

## [1] 1252

One of the realities in voting is that it doesn’t just matter that a vote was cast, it can matter where it was cast. I reasoned that this was also possibly the case for contributions. I wanted to have a look at where contributions were being made.

Again, I split the investigation into the highest contributing cities and then examined the contributions patterns for the majority of cities.

Contributions came from 1,252 cities across Ohio. The top five cities for contributions to presidential candidates were Columbus, Cincinnati, Cleveland, Dayton, Akron, and Toledo. Each of these cities had over 2,500 contributions from their inhabitants.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##     1.00     3.00    16.00   125.24    71.25 16304.00

When looking at the cities with the lowest 75% of contributions, there were many cities from which only one contribution was made. 25% of the cities had up to three contributions. All of the cities in the lowest 75% had 72 contributions or less. However, as we saw above, the maximum number of contributions for a city ranged up to 16,304 for Columbus.

This information highlighted the importance of location. If information is just looked at by state averages, the impacts of the contributions might be outweighed by what is found in cities with more contributions.

I realized that this might also help explain some of the differences seen in the numbers of contributions. Perhaps contributions for Clinton and/or Democractic candidates were more likely to occur in cities.

To see the spread of the contributions across Ohio, I plotted the contributions by their latitude and longitude coordinates that had come from the associated zipcode.

Even though all contributors had a state name of Ohio in the original FEC data, there were a number of coordinates that were well out of the Ohio borders. Either the zipcode information was wrong, or, some Ohio contributions were made out of state and this was the zipcode that was recorded.

I decided to exclude any information that didn’t fall within the Ohio borders. The limits for this were between -85 and -80 for the x-axis and 38 to 42 for the y-axis. I also looked up the central coordinates for the cities that were identified as having the top 6 highest contribution numbers to add to the map.

Finally, I added the state and county boundaries using the maps library.

In this plot, the darker the dot, the more contributions were made. We can see a concentration of votes around the cities (with some spread - possibly because residential zipcodes are often further out from the city center). This mirrored what we saw when examining in the individual cities, including some areas where very few contributions were made.

3.6 Contributions per County

## # A tibble: 10 x 2
##    county       n
##    <chr>    <int>
##  1 cuyahoga 15972
##  2 franklin 15419
##  3 hamilton 13896
##  4 summit    4379
##  5 butler    4133
##  6 warren    3244
##  7 lorain    3202
##  8 delaware  3106
##  9 stark     3069
## 10 clermont  2801

I also wanted to look at the contributions per county because this is often done for vote counts.

When looking at vote counts, it is not the county that is critical, but the electoral area. While electoral areas do not always align with county borders, it is common to report results occording to county. One of the reasons for this is that from decade to decade the boundaries for the electoral areas can change. Comparing by counties allows for some consistency across the years.

The 10 counties with the most contributions were Cuyahoga (Cleveland), Franklin (Columbus), Hamilton (Cincinnati), Summit (Akron), Butler (Cincinnati), Warren (Cincinnati), Lorain (Cleveland), Delaware(Columbus), Stark(Akron), and Clermont(Cincinnati). Each of these counties is associated with one of the major cities that was identified above.

To work out how to plot the spread of contributions across counties, I used this resource.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##      74     228     434    1517    1175   15972

Again, mapping the contribution by county confirmed what had been seen above - that counties with the highest numbers of contributions are associated with the cities that had the highest contribution numbers.

We can also see the great differences between county contribution numbers across the state. Counties with the lowest contribution numbers have 167 contributions or less, while the counties with the largest numbers of contributions have at least 2,750 contributions.

3.7 Contribution Values

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##     0.08    19.00    30.00   118.78    80.00 29100.00

There was a large spread in the dollar values of the contributions made, ranging from $0.08 to $29,1000.

To understand what was happening for the majority of the contributions I examined the bottom 95% of contribution values.

As we can see, 95% of the contributions were less than $400. The dollar amounts are also more common at the $25 and $50 intervals.

3.8 Contribution Dates

I wanted to be able to potentially classify dates by weeks or months so I added two columns to do this. I found that plotting the number of contributions per week provided a balance between too much variability (which can make interpretation difficult) and too much summarization (which can gloss over key details) when plotting over time.

The plotting showed some substantial fluctuations in how many contributions were received over time.

While contributions started from July 2014, they started to increase from around April 2015 (approximately a year and a half before the election) and continued up until December 2017, but dropped sharply with the election on November 8, 2017. There are three spikes of contributions, occuring from around February 2016 to April 2016, then part way through June 2016 with a big drop midway through August 2016, and picking up from this point until the drop with election day.

I noticed that there were contributions after the election day. While they may not contribute substantially to understanding the data, I thought it could be interesting to know why they exist.

The number of contributions over time continued to make me interested in looking into contribution values, election types and candidates and parties to see if any patterns emerged that might explain the eventual support of Trump.

3.9 Results Dataset

I examined the overall vote spread between candidates.

## election$candidate: Clinton, Hillary Rodham
## [1] 2317001
## --------------------------------------------------------
## election$candidate: Duncan, R
## [1] 23501
## --------------------------------------------------------
## election$candidate: Johnson, Gary
## [1] 168602
## --------------------------------------------------------
## election$candidate: Stein, Jill
## [1] 44310
## --------------------------------------------------------
## election$candidate: Trump, Donald J.
## [1] 2771984

While voting for a candidate not from the major two parties continues to be a point of discussion for US politics, in the case of Ohio, if all voters who voted for non-major party candidates had voted for Clinton, it would not have been sufficient for her to surpass Trump’s vote count.

3.10 Summary of Findings from Univariate Analysis

3.10.1 Primary investigation focus

While there are many interesting paths that could be followed with the above data, the above information solidified my interest in the original goal - to use information about contributions values to predict the final presidential election results.

3.10.2 Other key data features

From my explorations, other areas that appeared to be of value in achieving this goal were the party of the candidate, the type of election to which a contribution was made, where the contribution was made, and when the contribution was made.

4 Bivariate Exploration

Ok, are you still with me? Next step - two variables at a time!

I could have started my bivariate analysis focusing on the main relationship I was looking to observe, I felt that it was important to understand the factors that may contribute to this relationship first. With so many additional variables, it seemed easy to be able to miss a key element if I didn’t properly understand these ‘other’ factors first.

4.1 Pairwise plotting

When trying to understand the potential relationships in your data, it can be helpful to do some ‘en masse’ plotting. A number of different variables are all compared at the same time.

From the key investigative variables outlined above, I looked at candidate party, the type of election, the time of contribution and the amount of contribution. I excluded the name of the candidate and the geographic information because their inclusion would have created too many comparisons and likely made interpretation of the results incredibly difficult.

Here’s what can be seen in the plots.

4.1.1 Contribution Values

The plots in this column (far left) highlight the highly skewed data for contribution values with most of the plots showing a single bar
The comparison to date (second row) shows an interesting picture of how contribution values fluctuate over time, and also some common values for contributions, with $5,000 being the typical max, but regular contributions being up to approximately $2,500

4.1.2 Contribution Date

The slightly negative correlation between contribution date and value suggests that there may be some value in exploring contribution amounts over time, as contribution amounts decrease over time (Come to think of it, if there are less contributions in the general election - as we saw - this could make sense!)
There is almost no overlap in contribution dates for general election and primary election contributions
There are some differences in the times of contributions for the candidates of democratic and republican parties with contributions for democrats found primarily in 2016, while contributions for republicans are spread more throughout the period, with a few peaks (Again, we can see the higher numbers of contributions for democrats)
It seems the only contributions for candidates from non-major parties happened during the general election

4.1.3 Election Type

The inter-quartile range (IQR - the middle 50% - shown by a box) for contribution value for the two election types seems consistent but there is more variability in the outliers for the primary election contributions
The boxplots for date and election type show that there is overlap in the dates for the campaign types but that the key donation periods are very distinct
The contributions counts for candidate party compared to election type show that while there were relatively similar numbers of contributions made for republican and democratic candidates (with slightly more for democratic) in the primary election, there were substantially more contributions made for the democratic candidate than the republican candidate in the general election

4.1.4 Candidate Party

The value of the contributions for candidates from each party type (top-right) are quite similar (with what looks like a slightly higher IQR for “Other” candidates), but there is much more variability in the outlying contributions for republican candidates.
Again, we see the potential impact of contribution time, contributions to democrats were typically made later than for republicans, and latest for the candidate from a non-major party (in line with these only occurring for the general election)
And we see the confirmation of what has been previously surmised, similar contribution numbers in the primary elections for democrats and republicans (though more for the democrats), but a comparatively much larger difference in contributions numbers during the general election, weighted towards the democratic candidate

4.2 Contribution Target

From the pairwise comparisons I decided to first investigate the details of the contributions towards their target, this meant looking at both the candidate that received the contribution and their party alignment.

4.2.1 Number of Contributions Received per Candidate by Party

The first thing I did was take the black and white count plot from above and add color to it.

While number of contributions per candidate or party likely does not tell the full story in the data, I wanted to have the visual comparison to inform other investigations between variables as I moved forward.

The color-coding highlighted the partisan nature of election contributions with the vast majority of contributions made towards a candidate from one of the two major political parties.

Only one candidate that was not from a major party received notable contributions from Ohioans - Gary Johnson.

4.2.2 Value of Contributions per Candidate

The next logical step seemed to be to compare the number of contributions per candidate to the total value of contributions per candidate. For these plots, I did add in the additional element of color for party, even though it did technically add a third variable (and cause it to be not truly bivariate), but I felt that the additional color made for easier matching of the candidates through the plots and therefore easier comparisons.

With this plot you can immediately see that value of the contributions per candidate changes the picture of candidate support.

While Clinton still has the highest value of donations, Kasich and Trump now dramatically surpass Sanders in the value of contributions received.

The plot also gives us some insight into the differences in the sizes of individual contributions. While Sanders received the second highest number of contributions, the total value of those contributions was much lower, suggesting that the he received much smaller values of contributions than other candidates.

Conversely, the average value of the contributions for Kasich and Trump appears higher than for either of Clinton and Sanders.

That being said, Trump still only receives the third highest level of support for value of contributions from Ohioans, so more work needs to be done to fully understand his final support as presidential candiate.

The use of a boxplot begins to confirm some of what I had described above in the comparison of contribution numbers to total value.

We see that Kasich has a number of very large contributions, much higher than any other candidate. His largest single contribution received was close to $30,000. I discovered that Kasich is the Governor of Ohio, so it does make sense that he would be able to attract very large donations within Ohio.

For all other candidates their contributions maxed out at a single individual contribution of under $6,000.

To gain a better understanding of what was happening for the majority of the candidates, I adjusted the scale to view only contributions under $5,500. Again, to easily find and match candidates I included color for their party alignment.

We find that only four candidates received contributions over $5,000 - Clinton, Cruz, Kasich and Trump. After these outliers, there appears to be a threshold at $2,700. Excluding the outliers, ontributions for all candidates range up to this point.

We also see the depiction of the difference in the typical size of contributions for candidates. Both Clinton and Sanders have 75% of the contributions they received below $60. We can also see that while Kasich received a comparatively small number of contributions (as noted above), the 75% range of his contributions was much higher than any other candidate at $2,000.

The use of the boxplot also highlighted candidates like O’Malley and Pataki, who didn’t have a large number of contributions, but have the highest medians for value of contribution.

4.2.3 Value of Contributions by Candidate Alignment

I also wanted to collapse these contributions to make comparisons at the aggrevate level for political affiliation.

This was the first time that I could clearly see evidence of the final support of Trump for president within Ohio. Trump became the republican nominee for president, and, if voters who contributed to republican candidates are generally inclined to vote republican, we find possible evidence of the reason for support within Ohio for Trump’s candidacy.

With the goal of fully understanding the patterns in the data, I also wanted to look at the spread of contributions across the political alignments. I scaled down the plots to $1,000 to get a clear picture of the spread under 75%.

We find further confirmation of what we have been discussing. The typical range of contributions for democratic candidates is lower than for republican candidates. The median contribution for democratic candiates is $25 which is lower than the $50 median contribution for republican candidates.

Candidates who are not from the major political parties have a higher range of contribution values than those from the major parties, but some of this may be due to the impact of sample size as there are much fewer contributions for these candidates than for the major party candidates.

4.3 Contribution Timing

After getting a strong handle on how contributions were made to candidates at an individual and political alignment level, I moved to understanding how contributions played out over the course of time.

4.3.1 Value of Contributions over Time

The first port of call was obviously getting the overall picture of how the the value of contributions changed over time.

I used the same summarization technique as I had used for number of contributions and plotted the sum of all contributions per week over the course of the election cycle.

The peak patterns for the contribution values were similar to those seen when looking at contribution numbers, with both containing three similar peaks from the right side of the chart. However, adding contribution value saw an additional peak between July 2015 to September 2015, suggesting a smaller number of donations with large values during this period.

For the purposes of gaining a full understanding of the data, I decided to also look at the mean and median contribution values per week.

These plots tell us that while not many contributions were made early on in the campaign process, they were typically larger than the value of contributions made later in the campaign. Outside of noise early on in the campaign, there are three periods where it appears a smaller number of large value contributions were made, dragging up the mean while leaving the median relatively stable.Both the mean and the median also kicked up right at the end of the campaign period.

4.3.2 Contribution Dates by Election Type

The election course occurs such that first there are primary elections where voters select the presidential candidate for their party and then the nation votes on the president in the general election.

I felt that it was important to see how the type of election to which a contribution was made interacted with the time course of contributions. To keep the comparison to just two variables at a time, I went back to looking at the number of contributions and compared these to the election type of the contribution.

We can see a clear distinction between the contribution period for the primary election (from March 2015 to July 2016) and the general election (July 2016 to November 2016). While there are some small contributions made to the genereal election prior to July 2016, it makes sense that there are very few because at this time it is uncertain who will be running for president for each party. (This also reflects what we saw in the pairwise plotting)

4.3.3 Contribution Dates by Candidate Alignment

I also thought there could be some significance in how political alignment impacted contributions over time. Again, I kept this comparison to the number of contributions over time.

Each of the major parties had two peak times in terms of contributions. For the republican candidates it was between January 2016 and April 2016 and then another peak during mid-June to early August 2016. For the democratic candidates the two peaks are from January 2016 to June 2016, longer than the first republican peak. The second peak for the democratic candidates was later than the second republican peak, from late August up until the election in November 2016. From when contributions for both parties started peaking, numbers of contributions per week for democratic candidates typically stayed above 1,000 but contributions for republican candidates sometimes dropped to close to 0 contributions per week.

4.4 Contribution Geography

From looking at the time course of contributions, I moved to looking at where contributions were made across the state.

4.4.1 Contribution Value across the State

I first wanted to plot the size of the contribution values across the state.

With the introductions of the size component to this map we can see that while there is a concentration of larger donations found in the areas of the major cities, there are certainly some key areas outside of the major city areas where higher contributions are found. (Reminder: darker dots mean more contributions of that value)

4.4.2 Contribution Values by County

To better compare the differences in contribution value between major centres and the rest of the state, I plotted the contribution value by county.

The concentrations of higher values of contributions to major city centres make sense given that they have more inhabitants, but as we had seen in the contribution plot, there are a number of counties outside of the major cities where contribution values are reasonably high.

As was identified when looking at the individual donations, we can see that while there are some concentrations of larger donations in city areas, there are some counties outside of the major city areas that have higher average contributions than found in the major cities - Cleveland and Akron are the only major cities that have associated counties with the highest contribution averages.

Mean contributions per county ranged from $33 per contribution to $262 per contribution.

When considering the median, the range is much smaller than for the mean, ranging from $23 per contribution to $80 per contribution.

We can see that the highest median contributions move even further from the city centres, suggesting that the higher means that were observed are pulled up by a small amount of large contributions.

Median contributions for the counties directly associated with the city centres were highest for Cleveland and Cincinnati, but still fall in the middle bucket for median contributions. Toledo falls in the next bucket ranging from $27 to $28 per contribution and Columbus and Akron’s immediate county’s median contribution values are at the lowest end of the scale, below $27.

The progression of these plots suggests that while cities will always have a larger capture of contributions than smaller areas, there are many of smaller areas in the state where the typical individual contribution is quite a bit higher than in the major city areas, sometime by 2 to 3 times as much.

4.5 Election Results Compared to Contributions

With all of the comparisons between the ‘other’ variables completed, I moved back to the primary focus of my investigation - how do contributions impact election results?

4.5.1 Vote Counts v. Contribution Values for Presidential Candidates

I wanted to be able to compare vote counts for each candidate to the value of contributions that they received. Because vote counts were per county, I totaled the contributions received by each presidential candidate per county and then plotted each of the pairs of vote count and contribution value.

There was a strong relationship between the contributions received by a candidate and the corresponding votes that they received, but the distribution of values for both vote count and contribution value are quite spread out as the numbers increase.

In circumstances like this, the most appropriate thing to do before calculating a correlation is to transform the variables (using log10) so that they are more evenly distributed.

Tranforming the data reduced the amount of variability in the prediction errors (shown by the shaded grey area), which is what we are looking for.

There was a strong correlation (0.895 with 95% confidence that the value falls between 0.865 and 0.918) between the value of contributions received by a presidential candiate in a certain county and the number of votes that they received.

This is what I was hoping to find and was quite excited to get to this point.

4.5.2 Vote Counts v. Contribution Values by Candidate Alignment

In completing the comparisons for the presidential candidates, one of the things I realized was that I was missing all of the contribution information for the candidates who only campaigned during the primary election.

Because elections can be so influenced by candidate alignment, I thought that perhaps it would be important to capture all of the contributions made within a particular county and compare that to the results of the vote counts for the same alignment.

The pattern of results looked very similar to what was found for the candidate contributions, but the plot had the same problem with the range of data and so it needed to be transformed.

The correlation of votes to contributions received when grouped by party alignment (0.888 with 95% confidence that the value falls between 0.857 and 0.913) was incredibly similar to what was found above. In fact, each correlation fell within the confidence interval of the other.

This suggested that either the contributions towards only the presdiential candidates, or all of the contributions could be included in a prediction model, with similar results.

4.5.3 Individual Contributions

While the per county information was interesting, if I wanted to be able to include variables like election type or time course in my considerations I needed to examine the data at the individual contribution level.

Based on the two plots above I decided to use only individual contributions for the candidates within the general election.

This type of plot can be difficult to interpret at first, because it can be unclear what all of the different lines mean. They occur because there are a limited number of vote count values for each contribution amount. For example, a candidate may receive 100 votes in a county, but there are many different contributions they received that are associated with that vote count. This creates the different lines. However, the key to interpretation is that if there is a correlation similar to what we have found above, the contribution values associated with the lines closer to the top will be more to the right than the ones on the bottom.

This doesn’t seem to be the case, but there was the same issue with the large spread of the contributions and vote counts, so my next step was to transform the data as I had done before.

As a side note, we can see the impact of only including the presidential candidates in plot. We see the same general limit of $2,700 for the contribution values, with a $5,000 outlier around the 175,000 vote count.

The slope of the line in this plot says that there is a very slight correlation between vote count and the individual contributions (0.0539 with 95% confidence that the value is between 0.0464 and 0.0613). However, even though this value is considered statistically significant, for all practical considerations, it explains almost 0% of the variation in the data.

When I first received this result I was very surprised that there was essentially no correlation. In the light of the high correlations for the total values, it seemed that there must be some critical interactions happening that would improve the ability to explain the variations in the data. My plan was to explore the impacts of some of the other variables, such as political alignment, contribution date, geography or election type.

I did end up discovering why there was such a substantial difference between the plots for grouped contributions and individual contributions, but rather than spoil the surprise, I’m going to keep walking through the journey.

4.6 Summary of Findings from Bivariate Analysis

4.6.1 Primary investigation focus

As I’ve mentioned, the primary focus for this project was to investigate the relationship between the number of votes a candidate receives in the presidential election and the number of financial contributions they receive.

The results that I found for this were generally surprising.

Firstly, I found a very strong correlation between the total value of contributions received per county and the number of votes a candidate received. This was unexpected for me. While I expected there to be a correlation, I didn’t think that it would be as strong as it was (close to 0.9).

Secondly, I was then even more surprised to find almost no correlation between the election results and the individual contribution amounts. I was convinced that something was missing here and wanted to explore further.

4.6.2 Other key data features

There were differences in how contributions values ranged throughout the state. While the sum of contributions in larger city areas was higher than for less populated areas, the typical contributions were higher outside of the city.

When contributions were made differed between the politcal alignments, and, generally, contributions to the two election types were split into two time segments.

It was also discovered that while republican candidates received less contributions in number than democratic candidates, they received a higher total value of contributions, and a typically higher amount for an individual contribution.

Given that there were so many interactions between the variables, this suggested that there was a relationship between the individual contribution amounts and the election results but that it was mediated by some of these other variables.

4.6.3 Strongest relationship

The strongest relationship was between the result counts and contribution values for the candidates for each county. In some regards, given the strength of the correlation, it could be possible to just leave the investigation there, but, if I did that, I would only be able to predict count results once all contributions values were calculated. For use in the ‘real world’ this doesn’t seem very helpful and so I still wanted to pursue the individual contributions.

5 Multivariate Exploration

5.1 Interactions with ‘Other’ Variables

To gain a better understanding of how some of the other variables might interact with the correlation between election results and individual contributions, I explored some of these interactions.

5.1.1 Value of Contributions over Time by Party

One of the resources that informed my direction as I moved forward with building my model was a podcast episode from Hidden Brain (one of my favorite podcasts). An historian by the name of Allan Lichtman discussed his process for predicting election outcomes (he has done so correctly for the last nine presidential elections) and said that the hoopla that is made of the time course of the election cycle is more a product of the media than something that contributes to predictive value.

As a result, I decided not to spend much time focusing on using data over time as part of my predictive model, but instead decided to investigate whether I could find a period of time during the election that could be used as an appropriate subset to predict the final outcome. Obviously this would need to be before the results were finalized!

By plotting over time, and comparing contributions by political alignment, we see that there are clear differences in contribution patterns for candidates of the two major parties.

For democrats, the value of contributions per week starts off low and generally increases through the entire course of the election period.

For republicans, the value of contributions per week starts high, spikes up and down to dropping very low and then spikes back up again.

While I planned to stick with finding a single point in time to use as a predictor, if time course information were used, the different patterns in contribution rates for the different political alignments could give some insight into how tracking the information over time, and using this to inform predictions, could be confusing - depending on when you picked for your time course, the data shows differing support for candidates of the political alignments.

5.1.2 Faceted by Election Type

I had found earlier that election type appeared to act as a helpful tag for creating time boundaries for contributions and so I wanted to see how that would interact with the data above.

Splitting the contributions into the two election types reinforced the findings from above.

For each election type we see similar patterns to what were observed overall. Contributions per week for democratic candidates typically grew over time, while for republican candidates it typically started high and then fluctuated, but trended down within the associated time period. (This pattern is supported by the correlation between contributions values and time that was found in the pairwise plots, the found a slight negative correlation between the two)

5.1.3 Total Contributions by Alignment per Election

I also noticed that overall, contributions for republican candidates seems much higher than democratic candidates in the primary election, but the values seemed closer in the general election.

The disparity of contributions seen in this plot was slightly unexpected for me. The value of contributions to republican candidates in the primary elections were almost double those of democratic candidates. In the General election, the order reversed, but not to such a degree of difference, with democratic candidate receiving slight more in total contribution value than the republican candidate.

5.2 Incorporating into Primary Focus

With this additional information, I started to incorporate each of the other variables into the correlation between individual contributions and election results to discover their impact.

5.2.1 Candidate Alignment

Candidate alignment appeared to have a clear influence over how results panned out so this was the first variable I added into my previous scatter plot. I retained the use of the log transformations, as had been done previously. (A reminder that this data only includes contributions for the general election candidates)

The plot showed that there was some interaction between candidate alignment and the basic interaction. For the republican candidate, there was no correlation between vote counts and contribution values, but there was a slight positive relationship for the democratic candidate. (While I included the data for candidates classified with ‘Other’ as their political alignment for completeness, these correlations were not a focus of my investigation because they accounted for such a small amount of the data)

This was caused by the fact that while the contribution values for the highest vote counts still spanned a broad range, the contribution values for the lower vote counts for the democratic candidate were typically lower than for the higher vote counts.

That being said, while the correlation for the democratic candidate was an improvement - double the previous value, it was still only 0.111 (with a 95% confidence of the value between 0.103 and 0.121) which is still not especially meaningful.

5.2.2 Election Type

Again, there had been interactions between the contribution values, candidate alignment and election types, so I layered on election type to see

Adding in the election type showed another layer of interaction. The correlation size and direction remained relatively the same for the democratic candidate, but election type substantially changed the correlations for the republican candidate.

For the democratic candidate, correlations remained relatively unchanged from what was found above across each election.

Primary: 0.121 (95% confidence interval - 0.107 to 0.136)
General: 0.108 (95% confidence interal - 0.097 to 0.119)

For the republican candidate, for primary election contributions, the correlation between contribution values and the general election vote count was actually negative, but it became slightly positive for the general election.

Primary: -0.0199 (95% confidence interval - -0.0386 to -0.0013)
General: 0.0906 (95% confidence interval - 0.0709 to 0.1103)

However, even with these inclusions it still felt like much less of the differences in the data were being explained than could be expected.

5.3 Look at Geography

I knew that geography had shown differences in contribution values and so I turned my attention back here to see whether I could find something that could provide a better explanation of the differences in votes and contribution values.

5.3.1 Contributions across the State by Political Alignment by Election Type

I plotted all of the contribution values (not just those for the general election candidates) as I had done above, and then colored them by political alignment and split them into election types.

And with this, we can see so many of the different elements of what we have been discussing layered on top of each other. We can see a lot more contributions for republican candidates in the primary election, and a weighting towards the democratic candidate for the general election. We can also see clusters of support for republican or democratic candidates that often fall within county boundaries.

5.3.2 Relative Election Results by County

This suggested that where the contribution was made might be the missing factor in understanding the relationship to the final election results.

5.3.2.1 Relative Vote Count by Party per County

To determine whether the map of the contributions related to the election results, I decided to find the relative difference in vote counts between the candidates of the two major parties. If the democratic candidate had received more votes, the county would be blue, and if the republican candidate had received more votes, the county would be red.

The plot did start to highlight some of the concentrations but it didn’t allow for a clear understanding of the results. When the democratic candidate won a county, they typically did so in the major city centres, and so the relative vote counts were much higher than when the Repulican candidate won a county.

(It is a good reminder that when determining election results, boundaries for electoral areas don’t always match county areas, but it does remain a good long term comparison categorization)

5.3.2.2 Relative Vote Percent by Party per County

To reduce the impact of skew in result counts, I completed the same plots using percentage of votes (which acts similarly to finding the log10 of the differences, which is what we had done for vote counts in the scatterplot).

And finally we start to see a clear relationship in where contributions are happening and how that relates to general election results. Especially when compared to the general election result distributions of contribution, we can clearly see that areas of increased democratic support in terms of contributions are also associated with support in terms of voting results.

This also explains some of why the correlation of vote counts and total contribution values had such different value when compared to the individual contribution values. The contribution values had been totaled according to county (and candidate alignment) to correlate them with the county vote results. This inherently factored in the interaction that is found with location.

5.4 Building the Model

If you have made it this far, well done! And you would be forgiven for thinking that by this point that we were close to done in terms of finalizing everything. But, everything takes time, so keep up, because we are almost there!

As I’ve discussed before, I wanted to not just model the data but also use it for predictive purposes. This meant that I needed to not use the full dataset. Why? Because then I would be predicting the election results AFTER the election - which really doesn’t help anyone.

However, after playing with some of the data, I realized that at this point, I wanted to keep as much of it as possible. So I decided just limit the data to a few days before the election. That would mean that, hypothetically, a few days before I could gather up all the data, run it through the model and provide my predictions of what I thought would happen. So that meant I included contribution data up until and including November 5, 2016. (Again, this model only used contributions made towards the general election candidates)

5.4.1 Selecting Model Components

Based on what had been discovered above, I wanted to capture the four-way interaction between contribution values, candidate political alignment, the election type towards which the contribution was made and the county in which the contribution was made.

This meant that I needed to add all of the lower order interactions as well as the individual variables into the model one at a time, and then add the larger interaction.

5.4.2 Initial Proposed Model

Here’s what that looked like:

Primary Relationship

m1: log(count_vote) by log(contb_receipt_amt)

Individual Variables

m2: election_type
m3: county
m4: cand_party

Two-way Interactions

m5: election_tp * county
m6: election_tp * cand_party
m7: cand_party * county

Three-way Interaction

m8: election_tp * cand_party * county

The summary of that process looks like this.

##      ï..Statistic         m1         m2         m3         m4         m5
## 1       R-squared      0.003      0.003      0.024      0.060      0.066
## 2  adj. R-squared      0.003      0.003      0.022      0.059      0.064
## 3           sigma      0.580      0.580      0.575      0.564      0.562
## 4               F    216.321    110.322     17.949     46.413     26.382
## 5               p      0.000      0.000      0.000      0.000      0.000
## 6  Log-likelihood -57967.386 -57965.230 -57285.035 -56025.371 -55800.927
## 7        Deviance  22313.944  22312.492  21859.179  21043.860  20901.813
## 8             AIC 115940.771 115938.460 114752.071 112236.743 111961.854
## 9             BIC 115968.076 115974.866 115580.316 113083.191 113600.141
## 10              N  66277.000  66277.000  66277.000  66277.000  66277.000
##            m6         m7         m8
## 1       0.083      0.092      0.096
## 2       0.081      0.087      0.091
## 3       0.557      0.555      0.554
## 4      33.273     21.433     17.311
## 5       0.000      0.000      0.000
## 6  -55200.964 -54885.325 -54723.698
## 7   20526.796  20332.210  20233.285
## 8  110765.927 110396.650 110261.396
## 9  112422.418 113245.450 113965.746
## 10  66277.000  66277.000  66277.000

The way to read this is understand that each column tells you the model statistics for that step of the entry. The first column (m1) tells you the statistics of the base model, and the final column (m8) tells you the statistics of the final model, with all of the steps in between.

We can see that the initial model explains 0.3% of the variance (R-squared value). With the addition of the interaction variables, the total variance explained did increase to 9.6%, but this is still comparatively low for the model to be used for predictive purposes.

One thing that I noticed is that adding election type contributed no change to the variance explained (at three decimal places) when added as the single variable, and only 0.6% when added with county, but 1.7% when added with candidate party. I was surprised that election type seemed to have so little impact but it seems that the last interaction is the most important.

I was expecting that more of the variation would have been explained here.

5.4.3 Subsequent Proposed Model

Because this model, using the individual contribution values, did not explain sufficient variance in the data, I went back to the total contribution values by candidate political alignment and county.

I decided that given that there had been limited difference between the correlations for contributions including all candidates compared to just the general election candidates that I would include them all to capture as much of the data as was available.

This is what the model looked like:

Primary Relationship

m1: log(count_vote) by log(contb_receipt_amt)

Individual Variables

m2: election_type
m3: county
m4: cand_party

Two-way Interactions

m5: election_tp * county
m6: election_tp * cand_party
m7: cand_party * county

I didn’t need to add the final interaction because the contribution values were now summed.

Here’s a summary of the new model:

##      ï..Statistic       m1       m2       m3       m4       m5       m6
## 1       R-squared    0.734    0.773    0.880    0.893    0.912    0.935
## 2  adj. R-squared    0.734    0.772    0.846    0.863    0.844    0.884
## 3           sigma    0.960    0.889    0.730    0.690    0.734    0.634
## 4               F 1122.706  688.240   26.107   29.057   13.394   18.203
## 5               p    0.000    0.000    0.000    0.000    0.000    0.000
## 6  Log-likelihood -561.292 -529.572 -399.881 -375.357 -335.100 -273.495
## 7        Deviance  374.212  320.324  169.625  150.411  123.475   91.291
## 8             AIC 1128.584 1067.145  981.762  936.714 1030.200  910.989
## 9             BIC 1140.618 1083.190 1346.787 1309.761 1752.229 1641.040
## 10              N  408.000  408.000  408.000  408.000  408.000  408.000
##          m7
## 1     0.987
## 2     0.944
## 3     0.440
## 4    23.073
## 5     0.000
## 6    51.028
## 7    18.602
## 8   523.945
## 9  1779.471
## 10  408.000

Again, we read the model information across the columns. Now, the initial model explained 73.4% of the data, and the final model exlained 98.7%.

With each new element introduced to the model there is an increase of at least 1% in the variance explained. For the individual variables, adding in the candidate alignment added the greatest contribution of an additional 10.7%. For the interactions, each of election type * county and election type * candidate alignment added less than 3% each.

5.4.4 Model Explanation

To assist in explaining how the model functions, I made a number of visualizations.

5.4.4.1 Interaction with Election Type

There is an interaction from election results in how total contributions predicts vote count.

Both contributions from the primary and general elections are positively correlated with the vote count, but there is difference in how the predictions function at high contribution ranges.

At lower contribution values, the vote counts predicted by the contributions to the primary and general elections are similar. However, as the total contributions received increases, the vote count predicted from the general election contributions is higher than for the primary election contributions.

5.4.4.2 Interaction with Candidate Alignment

Due to the three types of alignment, the interaction due to candidate alignment is a bit more complicated.

For candidates of the two major parties the interaction occurs as follows. When the value of contributions recieved is lower, the number of votes predicted for republican candidates will be higher than that predicted for democratic candidates.

However, this order reverses when we get to high total contributions. At high values of total contributions, the votes predicted for democratic candidates is higher than that predicted for republican candidates.

This pattern appears to pick up on what was seen in the relative vote counts. In larger cities democrats were more likely to receive contributions and much higher vote counts but this changed for smaller areas that had a smaller contribution total.

For candidates not from the major parties, they only have predictions for total contributions under approximately $10,000 and receiving 10,000 votes. Prediction of vote counts for the ‘other’ candidates falls below the republican candidate and above the democratic candidate at these lower contribution levels.

5.4.4.3 Interaction with County

It is difficult to visualize this information with the full set of data so I selected three different counties to show the impact of the interaction with county.

In short, the most important takeaway is that the way in which contributions received predicts vote count is different between the counties. In this example we can see the strength of the correlation for Adams and Hamilton are similar, but the variability in errors (the grey shaded areas) associated with these correlations is quite different. Van Wert differs in it’s correlation between contributions received and vote count and has prediction errors somewhere between the two.

One of the things that I do notice here, is that in contrast to the other two factors, predictions associated with county have a lot of variability in the range of prediction errors (the shaded grey area) at different contribution values (called heteroscadacity). I realized that this could cause some issues for the model predictions, and was something to account for.

5.4.4.4 Interaction with Candidate Alignment by Election Type

The interaction between candidate alignment and predictions of vote counts functions similarly for the primary election as it did for the combined data, but for the general election the, the point at which they cross over is much higher. That is, for the general election contributions, it takes until approximately $100,000 in contributions received (comapared to approximately $10,000 in the primary) for the prediction of votes for the democratic candidate to exceed that of the republican candidate.

I also noticed the spread of predictions for these plots. It appears that there is greater precision in prediction around the $10,000 mark for contributions received in either of the primary or general elections.

5.4.4.5 Interaction with Candidate Alignment by County

To show this relationship I have gone back to our three previously selected counties. What we are once again focusing on is that there are differences between the counties.

For the most part, the plot is a replication of what we saw above - points of the same color are in the same places and so are the associated lines. However, if we look at the shape of the dots, we find the differences.

What this plot shows us is that the relationship between how the prediction of vote counts changes across the counties for the different political alignments.

Starting with Hamilton (orange), we can see that for the democrat (dot) and republican (triangle) candidates the differences in contributions received doesn’t have much impact on the predicted vote count, but there is a substantial difference in the vote count predicted for the ‘other’ candidates. In Van Wert (purple), the differences in contributions received for candidates in each of the political alignments impacts the prediction of votes received. Adams doesn’t have any contributions for ‘other’ candidates, but does show differences in predicted vote counts for the different contribution values.

5.4.4.6 Interaction with County by Election Type

For the final plot to explain the interactions, we have our three counties and we can see that the election type to which the contribution was made changes how contributions received predicts vote count.

For the primary election contributions, all the counties have a relatively similar relationship between contributions received and predicted vote count. But general election contributions, the relationship is dramatically changed for Adams. In Adams, very similar totals for contributions received were associated with very different predictions for vote count.

5.4.5 Alternate Model

Because of the heteroscadacity associated with predictions with county, I created an additional model with county removed and planned to test the predictions of the full model with this simpler model.

This is what the model looked like:

Primary Relationship

m1a: log(count_vote) by log(contb_receipt_amt)

Individual Variables

m2a: election_type
m3a: cand_party

Two-way Interactions

m4a: cand_party * county

##     ï..Statistics      m1a      m2a      m3a      m4a
## 1       R-squared    0.734    0.773    0.807    0.952
## 2  adj. R-squared    0.734    0.772    0.805    0.895
## 3           sigma    0.960    0.889    0.821    0.604
## 4               F 1122.706  688.240  421.706   16.580
## 5               p    0.000    0.000    0.000    0.000
## 6  Log-likelihood -561.292 -529.572 -495.997 -211.689
## 7        Deviance  374.212  320.324  271.713   67.430
## 8             AIC 1128.584 1067.145 1003.993  871.379
## 9             BIC 1140.618 1083.190 1028.061 1769.903
## 10              N  408.000  408.000  408.000  408.000

This model is much simpler than the full model and still explains 95.2% of the variance in the data. If it can reduce variability in predictions, it could be very helpful to use.

The model functions in a similar fashion to the above, with all of the details related to county removed.

5.4.6 Predicting Results

Now that I had the model(s) explained, I wanted to look at the capabilities of the model to predict election results. Based on what I had seen from the functionality plots of the model, I decided to select a county that had contribution values for the general and primary elections from $8,000 to $50,000 (Where there was the smallest amount of prediction error observered).

Three counties were identified where contributions for candidate of the two major parties fell within this range for the general and primary elections. These were Erie, Greene, and Portage.

I decided to pick Greene as it had the greatest differences in contribution rates and vote counts. I wasn’t certain how well the model would do with prediction for smaller differences so I wanted to start with something that was most likely to find a difference.

So what does this tell us? First of all, let’s confirm what all of the lines and shading mean. The solid lines show the actual vote count that was found for the county (these will be the same for each chart pair). The dashed lines show what was predicted from the contribution totals for each election type.

The shaded areas show the range of values that fall within a certain confidence interval. For example, for the prediction of the votes for the democratic candidate from the primary election contributions, at a confidence level of 40%, the vote count values range from around 25,000 to slightly under 50,000. We would say that we are 40% confident that the actual vote count value falls in this range. If we look at the solid blue line, it does fall within this range. So that is good for the model.

Looking at the details of the two charts, here’s how it all falls out.

Primary

democrat Predicted: approximately 35,000 votes
democrat Actual: appoximately 30,000 votes
republican Predicted: approximately 60,000 votes
republican Actual: approximately 50,000 votes

The predicted vote count values were reasonably in range of the actuals, and the republican candidate was predicted to win over the democratic candidate, which actually occurred. This is a good start for the model!

General

democrat Predicted: approximately 25,000 votes
democrat Actual: appoximately 30,000 votes
republican Predicted: approximately 20,000 votes
republican Actual: approximately 50,000 votes

Here we start to run into some problems. The predictions for the democratic candidate are quite close to the actuals, but this is not the case for the republican candidate. In addition, the final outcome is also wrong - the democratic candidate was predicted to win instead of the republican candidate.

I wanted to compare this to the predictions for the alternate model, but first, here are some other things I observed about the charts.

Across the plots, the shaded areas are the same shape. This isn’t something that I had expected but makes sense since they are coming from the same model.
There is a broad range in the values that create the lower and upper confidence interval boundaries as we move through the confidence levels. While this is to be expected, I noticed that the increases were exponentially related, which is a result of the prediction model using the log of both contribution values and vote counts.
This also means that if you track the lower bound values of the confidence interval, there is much less range in these values. For the predictions from the primary election contributions, the lower bound confidence interval values range from slight over $25,000 at a 35% confidence interval to just over $10,000 for the 95% confidence interval. For the upper boundary of the confidence intervals, the values range from approximately 40,000 votes to approximately 150,000 votes.

5.4.6.1 Comparison to Alternate Model

The first thing that is noticeable about these plots is that the predictions from both the primary and general election contributions are in the correct direction. In addition, the reason that you cannot see the red dashed line for predicting the results for the republican candidate from the primary election contributions is because it is covered by the actual line - the model predicted the actual result!

When comparing the predictions of the two models, the range of the 95% confidence interval is typically greater for the alternate model, but the differences between the predicted values and the actual values are smaller.

I completed this type of comparison for multiple counties and discovered that this patterns were quite consistent - the original model would sometimes get the order of the predictions flipped, and the differences in predicted and actual values were closer for the alternate model.

As a result, I decided to continue with the alternate model.

One of the things I found interesting about this process is the model that explained the lower amount of variance was actually better at predicting the results. I had recently read an article, aptly called “Is R-Squared Useless?”, where the author discusses how using the amount of variance explained may not be a good method of selecting models for prediction, and that instead, the predictions from the model should be compared and tested and the model selected based on these results. (Which is what I did!) I had wondered if these types of circumstances were more rare and not something that I would encounter, but I apparently discovered this to be true with the very first model I created!

5.4.6.2 Comparison Across Counties

Now that I had selected my model, I wanted to see how it ran across a number of different scenarios. I wanted to look at counties that had wins for democrats and republicans and when the final vote counts were close, and also at different ranges of contribution values.

Close Results

In this case, the results for the two candidates were very close. I was excited to see that the model did predict the correct order for results - more votes for the republican candidate than the democratic candiate. In this case, the predictions from the primary election contributions were much closer to the actual results than those from the general election contributions.

One thing that is worth pointing out here is that typically in these types of circumstances, to say that a difference exists you would want the range of the confidence intervals to actually not overlap. So far, while we’ve seen predictions in the correct direction, we’ve not found a case where there isn’t an overlap - something to keep an eye on.

republican Win - Lower Range

In this case, the predictions from the general election were more accurate - the predictions from the primary underestimate the results - but each had a reasonable prediction of the relative diferences between the votes.

republican Win - Higher Range

A similar pattern as was found above is seen. Again, the predictions from the general election were more accurate, and actually incredibly close. Both got the order correct and the predictions from the primary had a similar difference betwen the candidates but underestimated the result counts.

democrat Win

So far, all of the counties we have viewed had wins for the republican candidate, so I wanted to see how well the model did at prediction when the democratic candidate won.

For the first county I selected it did not look good. Predictions from both the primary and general election contributions had the republican candidate winning. I looked into this further and discovered that in this county, while the Democractic candidate did win, the contributions for the democratic candidate were quite a bit lower than for the republican candidate, so I decided to look at some other counties.

democrat Win - Lower Range

Now the model was able to predict the win for the democratic candidate over the republican candidate, but the actual values predicted for the democratic candidate differed far more from the actual than they did for the republican candidate.

This is the first time that we’ve seen a clear difference in the shaded areas, but the predicted difference between the vote counts was much greater than the actual difference.

democrat Win - Higher Range

The order of predictions is once again correct, with a more accurate prediction from the general election contributions. Both models do predict a relatively small difference between the two candidates.

5.5 Summary of Findings from Multivariate Analysis

5.5.1 Key Findings

The multivariate analysis supported the relationships that had been observed during the bivariate analysis. For the primary relationship between contribution values and vote counts, I found that all of candidate political alignment, county, and the election type to which a contribution was made, created differences in the primary relationship.

These interactions also layered on top of each other so that values for contributions and vote counts each differed with a combination of the ‘other’ variables. Election type created a difference in the pattern of contributions to candidates of the different political alignments, contribution patterns differed across counties between the primary and general elections, and vote counts differed between the candidates across the counties.

I also observed that county as a mitigating factor in the investigation of how candidate political alignment interacted with the prediction of vote counts from contribution values.

5.5.2 Surprising Elements

Perhaps the most surprising element is that while I was able to find interactions between the other variables and the primary relationship, I was still able to explain less variance than I would have expected using individual contributions.

5.5.3 Model Strengths and Weakenesses

Creation of the model also resulted in some surprises. I wasn’t able to find the strength of relationship for prediction of vote counts based on individual contributions that I was hoping to find and so switched to using the totals of contributions per county.

The original model I proposed explained over 98% of the variance, but I noticed that from a prediction perspective, it suffered from some heteroscadacity that might impact the quality of its predictions. This ended up being the case and so I moved forward with the alternate model. It explained slightly less variance at approximately 95% but this was still quite substantial in the amount of explained variance.

In terms of predicting what I set out to predict - the numbers of vote counts from contribution values, the model did well in some areas but also had weaknesses. It did quite well at predicting the order of votes counts (who would win), but the confidence intervals for the ranges in the values were very large. Even at low confidence percentages, there was often not a difference observed between the confidence intervals for the candidates of the two major parties.

In addition, the following was observed:

Results were more accurate for republican vote counts
General was typically more accurate than the Primary
While the model can correctly predict wins for republican and democratic candidates, it is susceptible to errors when wins are associated with a much lower total value of contributions.
Of the counties reviewed, the model correctly predicted the direction of vote counts in 6 out of the 7 counties (85.7%)

6 Final Plots and Summary

6.0.1 Election Type Interacts with Contribution Patterns to Candidates

When looking to predict vote counts, one of the features to understand is that contribution patterns to candidates change throughout the election cycle and especially between the two election types - primary elections where the party candidates are selected, and general elections where the president is elected.

In Ohio, contributors to republican candidates made a greater value of contributions to candidates in the primary election than the general election and in the primary election their contributions almost doubled those made to democratic candidates. However, in the general election, more contributions were made to Clinton (the democratic nominee) than to Trump (the republican nominee).

If pundits were only, or primarily, looking at values of contributions within the general election period with which to make their predictions of who would win the presidential election, this may one reason that predicts were wrong regarding the eventual winner of the presidential election.

There are a number of elements that do help give us a better picture of how contributions impact the number of votes a presidential election candidate will receive. These include the party of the candidate, and the type of election to which the contribution was made.

6.0.2 Contributions to Candidates Align with Election Results

If we map the contributions and compare those with the differences in votes received for Clinton and Trump, we can see that areas with concentrations of contributions towards Clinton align with the counties in which she won the vote count.

Using the general election contributions map, we can again see how if these were the only results evaluated that they may be taken as a sign of support for Clinton over Trump. It does make sense to consider that there is some correlation between contributions received and vote counts, but without the inclusion of the contributions in the primary elections, valuable information is missing.

6.0.3 Predictions Can be Made from Contribution Values

If we do incorporate contribution values from both the primary and elections and look at how they interact with candidate political alignment, we are able to make reasonably accurate predictions about who will win in a particular county. While the exact number of votes may be more in question, the model that was developed is able to consistently predict when Clinton would receive the most votes and when Trump would receive the most votes, even when the final election results were very close.

And we finally have what we have been waiting for!! In answer to the question posed at the beginning of this investigation, yes, there was information available to suggest that Trump would have the types of victories he came away with. If I was able to obtain information about the contributions made by candidate alignment across the election types, I believe that I would have a fair chance of predicting who would win the election.

One side issue that I will note when using contribution data for predictions - I am uncertain when this information is made available publicly. If the data is regularly updated and available throughout the election cycle, (I wasn’t able to work out if this was the case) then it is of benefit for predicting. But, if it is only available after the fact, while the model may have some reasonable prediction capacities, it wouldn’t be very helpful in predicting the winner BEFORE it occured!

And if you made it this far, I sincerely congratulate you! It was a long and winding journey, but wasn't it fun?! As long as it took, I really did have a fantastic time getting here.

7 Reflection

7.1 Unexpected Occurrences

The way the correlations ended up unfolding was unexpected for me. I didn’t expect such a high correlation between the contribution totals and the vote counts given that this is real world data. In addition, I wasn’t expecting such a low final correlation when using individual contributions compared to contribution totals.

I have a suspicion that it still might be possible to use the individual contribution information, but I’m not certain how to go about doing this.

7.2 Successes

There were two key turning points in the progression of the project that reinforced that I was moving in the correct direction to achieve the goal I had set for myself.

The first was the creation of the maps showing the individual contributions by size and candidate alignment, and then splitting into election type. I actually completed this plot well before I had done a lot of other analysis because I was wanting to test my skills to see if I could produce the map in the first place. The resulting maps really helped me clue into some of the factors that might be influencing the final relationships. They also helped me to believe that it was worth persisting when I ran into other difficulties, because they confirmed that there really was something to find.

The second breakthrough was the creation of the set of plots faceted by county that showed the different relationships between contribution values and vote counts for the candidates across the state. It gave me an understanding that there was interconnectivity between the other variables and their connection to the primary relationship that I was exploring.

The final success was building the model. While it didn’t do exactly what I had hoped to do - confidently predict differences in vote counts between major party candidates, it did consistently predict the winner, which to some extent covers what is typically looked at in an election. This was my first forray into independently building a prediction model and I was quite encouraged by the result.

7.3 Challenges

Definitely the most challenging, or time consuming, component of the process was the data cleaning. This was a dataset about which I had no previous knowledge (apart from generally being aware of what happens in US elections) to inform my investigation of the data. I had to research most elements of the data to confirm how they functioned, to understand what I could and couldn’t drop. Understanding how refunds could work, to finally make a decision to simply drop all associated records, took the better part of a day!

Another thing that I found somewhat challenging was compartmentalizing and limiting my thought progression to move from univariate, to bivariate, to multivariate. I kept wanting to run ahead as I could see different elements playing in. However, by disciplining myself to focus on the section on which I was currently working, I was able to find insights that I may not have otherwise found and I did gain a much better understanding of the data as a whole. When I finally got around to modelling the data, the step by step process became even more understandable as this step-wise progression is also how we enter data into the model, and it makes sense for us to fully understand the relationships at each level to help us to decide what should be included in the model.

My final challenges related more to continuing to learn how to code. There were a couple of times where a had built a foundational plot or set of variables and then built other elements of the investigation off of this. I sometimes decided that I wanted to change a name or function of the foundational components and this would set off a chain reaction of errors. It reinforced that I need to make sure that I am happy with the original structure before moving on and building on it!

7.4 Missing Elements and Potential Next Steps

The largest missing element from this work, in my opinion, is the large confidence interval ranges that were found with the prediction model. Even at relatively low confidence interval percentages, the range of predicted vote counts was still often quite high.

I believe that a large source of this is that when the contribution values are totaled, it results in a relatively small number of comparisons - just over 400 observations. When this information is split out into counties, it becomes even smaller. One of the potential ways to remedy this could be to utilize bootstrapping to create more robust sampling. It would be possible to take the original set of over 150,000 contributions and randomly sample them to build up many different representations of the possible combinations of vote counts and total contributions per county. This could then be combined together in building the model to hopefully reduce the range of the confidence intervals.

There are also a number of additional steps that could be taken to further explore the work that I have done.

Use the model to predict the results from other states - do the prediction capabilities hold up outside of Ohio?
Use the model to predict the results for Ohio for the presidential election in other years - do the patterns of the 2016 election cycle mirror those found in previous years?
I used a very late time in the election cycle (data up to a few days beforehand), it would be interesting to see what predictions are possible earlier on, even predicting from the primary contributions with this model seemed useful in predicting the winner.
Given that Ohio does have bellwether counties it could be interesting to build a model that just predicts using information from those states. Essentially this and the point above are interested in, “What is the minimum amount of data you would need to still consistently predict the outcome?”

The final additional development, that I think could yield some productive results, would be to rebuild the modelling by predicting a candidates win or less, instead of the actual vote counts. This may prove more successful in increasing confidence that differences between the candidates are seen.

Predicting US Election Results

1 Introduction

1.1 The Datasets

2 Data Cleaning

2.1 FEC Dataset

2.2 Countyzip Dataset

3 Univariate Exploration

3.1 Candidates

3.2 Parties

3.3 Contributions per Election Type

3.4 Contributions per Contributor

3.5 Contributions per City

3.6 Contributions per County

3.7 Contribution Values

3.8 Contribution Dates

3.9 Results Dataset

3.10 Summary of Findings from Univariate Analysis

3.10.1 Primary investigation focus

3.10.2 Other key data features

4 Bivariate Exploration

4.1 Pairwise plotting

4.1.1 Contribution Values

4.1.2 Contribution Date

4.1.3 Election Type

4.1.4 Candidate Party

4.2 Contribution Target

4.2.1 Number of Contributions Received per Candidate by Party

4.2.2 Value of Contributions per Candidate

4.2.3 Value of Contributions by Candidate Alignment

4.3 Contribution Timing

4.3.1 Value of Contributions over Time

4.3.2 Contribution Dates by Election Type

4.3.3 Contribution Dates by Candidate Alignment

4.4 Contribution Geography

4.4.1 Contribution Value across the State

4.4.2 Contribution Values by County

4.5 Election Results Compared to Contributions

4.5.1 Vote Counts v. Contribution Values for Presidential Candidates

4.5.2 Vote Counts v. Contribution Values by Candidate Alignment

4.5.3 Individual Contributions

4.6 Summary of Findings from Bivariate Analysis

4.6.1 Primary investigation focus

4.6.2 Other key data features

4.6.3 Strongest relationship

5 Multivariate Exploration

5.1 Interactions with ‘Other’ Variables

5.1.1 Value of Contributions over Time by Party

5.1.2 Faceted by Election Type

5.1.3 Total Contributions by Alignment per Election

5.2 Incorporating into Primary Focus

5.2.1 Candidate Alignment

5.2.2 Election Type

5.3 Look at Geography

5.3.1 Contributions across the State by Political Alignment by Election Type

5.3.2 Relative Election Results by County

5.3.2.1 Relative Vote Count by Party per County

5.3.2.2 Relative Vote Percent by Party per County

5.3.3 Facet Scatterplot by Candidate Alignment and County

5.4 Building the Model

5.4.1 Selecting Model Components

5.4.2 Initial Proposed Model

5.4.3 Subsequent Proposed Model

5.4.4 Model Explanation

5.4.4.1 Interaction with Election Type

5.4.4.2 Interaction with Candidate Alignment

5.4.4.3 Interaction with County

5.4.4.4 Interaction with Candidate Alignment by Election Type

5.4.4.5 Interaction with Candidate Alignment by County

5.4.4.6 Interaction with County by Election Type

5.4.5 Alternate Model

5.4.6 Predicting Results

5.4.6.1 Comparison to Alternate Model

5.4.6.2 Comparison Across Counties

5.5 Summary of Findings from Multivariate Analysis

5.5.1 Key Findings

5.5.2 Surprising Elements

5.5.3 Model Strengths and Weakenesses

6 Final Plots and Summary

6.0.1 Election Type Interacts with Contribution Patterns to Candidates

6.0.2 Contributions to Candidates Align with Election Results