Machine Learning: Charity Donor Analysis
Machine Learning: Charity Donor Analysis
Introduction
A charitable organization wishes to develop a machine learning model to improve the cost effectiveness of their direct marketing campaigns to previous donors. The recent mailing records reflect an overall 10% response rate with an average donation of $14.50. The cost to produce and send each mail is $2. The data consists of 3,984 training observations, 2,018 validation observations and 2,007 test observations. Weighted sampling has been used which overrepresents the responders so that the training and validation samples have approximately equal numbers of donors and nondonors. Additionally, a prediction model to predict the expected gift amounts from donors should also be built based on the records of donors only.
Analysis
Exploratory Data Analysis
Based on R’s describe function, there are 22 variables and 8,009 observations in the initial dataset with no missing data. The observations reflect prior mailing data including the whether the mailing was successful and a donation was received, the amount of the donation along with numerous other aspects that may or may not be helpful in determining who to send future donation requests to in order to maximize the amount of donations. Categorical and numerical variables along with a brief description and statistical information are shown below.
Categorical Variables:
Var  Description  Mean  SD  Median  Min  Max  skew  Estimate  Std.  Error  t value  Pr(>t) 
reg1  Geographic region  0.2  0.4  0  0  1  1.5  1.44E01  1.38E02  10.462  <2.00E16  *** 
reg2  Geographic region  0.32  0.47  0  0  1  0.78  2.92E01  1.23E02  23.793  <2.00E16  *** 
reg3  Geographic region  0.13  0.34  0  0  1  2.15  4.74E03  1.59E02  0.299  0.76505  
reg4  Geographic region  0.14  0.35  0  0  1  2.08  2.36E02  1.54E02  1.53  0.12609  
home  Homeowner; 1=homeowner 0= Not  0.87  0.34  1  0  1  2.16  3.23E01  1.37E02  23.583  <2.00E16  *** 
wrat  Wealth Rating 09; 9 highest  6.91  2.43  8  0  9  1.35  3.50E02  1.87E03  18.651  <2.00E16  *** 
genf  Gender; 1=M; 1=F  0.61  0.49  1  0  1  0.43  1.54E03  8.90E03  0.173  0.86246  
donr  Donar; 1Donor; 0=Nondonor  0.5  0.5  0  0  1  0 
Numerical Variables:
Var  Description  Mean  SD  Median  Min  Max  skew  Estimate  Std.  Error  t value  Pr(>t) 
chld  Number of children  1.72  1.4  2  0  5  0.27  1.65E01  3.11E03  52.918  <2.00E16  *** 
hinc  Household Income 7 categories  3.91  1.47  4  1  7  0.01  3.42E01  1.24E02  27.476  <2.00E16  *** 
I(hinc^2)  4.24E02  1.53E03  27.643  <2.00E16  ***  
avhv  Average Home Value in 000’s  5.14  0.37  5.13  3.87  6.57  0.14  1.35E02  2.23E02  0.604  0.5459  
incm  Median Family Income in 000’s  43.47  24.71  38  3  287  2.05  1.59E03  3.74E04  4.262  2.06E05  *** 
inca  Average Family Income in 000’s  56.43  24.82  51  12  305  1.94  3.27E05  4.32E04  0.076  0.93969  
plow  % categorized as “low income”  14.23  13.41  10  0  87  1.36  1.40E03  5.08E04  2.76  0.00581  ** 
npro  Lifetime # of promotions rec’d to date  60.03  30.35  58  2  164  0.31  1.46E03  2.06E04  7.103  1.36E12  *** 
tgif  $ value of lifetime gifts to date  113.07  85.48  89  23  2057  6.55  1.31E04  7.31E05  1.786  0.07422  . 
lgif  $ value of largest gift to date  22.94  29.95  16  3  681  7.81  5.88E05  2.22E04  0.265  0.79094  
rgif  $ value of most recent gift  15.66  12.43  12  1  173  2.63  3.46E04  5.65E04  0.612  0.54062  
tdon  # months since last donation  18.86  5.78  18  5  40  1.1  5.27E03  7.77E04  6.78  1.32E11  *** 
tlag  # of months between 1st and 2nd gift  6.36  3.7  5  1  34  2.42  1.23E02  1.20E03  10.29  <2.00E16  *** 
agif  Average $ value of gifts to date  11.68  6.57  10.23  1.29  72.27  1.78  1.10E03  9.78E04  1.12  0.26285  
damt  Donation amount in $’s  7.21  7.36  0  0  27  0.12 
Correlations
Reviewing the correlation between variables and donation amount (damt) some of the larger correlations are as follows:
ID  reg1  reg2  reg3  reg4  home  chld  hinc  genf  wrat  
tdon  0.0693  0.0180  0.0168  0.0157  0.0222  0.0162  0.0602  0.0147  0.0033  0.0215 
donr  0.0308  0.0565  0.2471  0.1043  0.1263  0.2890  0.5308  0.0277  0.0173  0.2493 
damt  0.0277  0.0469  0.2115  0.0837  0.0887  0.2877  0.5513  0.0453  0.0190  0.2429 
avhv  incm  inca  plow  npro  tgif  lgif  rgif  tdon  tlag  
avhv  1.0000  0.6939  0.8072  0.7118  0.0030  0.0178  0.0167  0.0052  0.0064  0.0061 
incm  0.6939  1.0000  0.8729  0.6555  0.0171  0.0417  0.0067  0.0027  0.0139  0.0213 
inca  0.8072  0.8729  1.0000  0.6346  0.0183  0.0354  0.0075  0.0054  0.0163  0.0169 
tgif  0.0178  0.0417  0.0354  0.0176  0.7266  1.0000  0.1734  0.0736  0.0113  0.0117 
lgif  0.0167  0.0067  0.0075  0.0059  0.0013  0.1734  1.0000  0.6961  0.0036  0.0168 
rgif  0.0052  0.0027  0.0054  0.0136  0.0125  0.0736  0.6961  1.0000  0.0063  0.0122 
agif  0.0000  0.0103  0.0002  0.0137  0.0022  0.0558  0.6096  0.7053  0.0080  0.0299 
agif  donr  damt  
chld  0.0149  0.5308  0.5513  
rgif  0.7053  0.0149  0.0851  
donr  0.0095  1.0000  0.9826 
Logistic Regression
Some of the significant variables are reflected below but additional insight into which variables are more significant based on the model will be provided in the discussion of the related model.
Estimate  Std.  Error  z value  Pr(>z)  
(Intercept)  0.5184  0.0661  7.845  4.34E15  ***  
reg1  0.6449  0.0716  9.012  <  2.00E16  *** 
reg2  1.4807  0.0842  17.577  <  2.00E16  *** 
home  1.3657  0.0836  16.342  <  2.00E16  *** 
chld  2.3711  0.0854  27.779  <  2.00E16  *** 
I(hinc^2)  1.0750  0.0530  20.283  <  2.00E16  *** 
wrat  0.9726  0.0667  14.589  <  2.00E16  *** 
npro  0.4552  0.0786  5.790  7.05E09  ***  
tdon  0.2945  0.0603  4.883  1.04E06  ***  
tlag  0.5543  0.0602  9.207  <  2.00E16  *** 
Classification Modeling
Goal: Maximize the gross margin (expected revenue less the costs) on the marketing campaign
0 ≠ donor  1 = donor  Total  
0  526  23  549 
1 = mail  493  976  1,469 
Utilizing the data, various machine learning modelling techniques were applied to determine the number of households to receive the marketing materials that are most likely to be a donor. The outcome of each technique results in a confusion matrix that is used to determine the percentage of mailers that are successful versus those that do not ultimately result in a donation. For example, in Quadratic Discriminant Analysis (QDA) the resulting confusion matrix is:
0 ≠ donor  1 = donor  Total  
0  526  23  549 
1 = mail  493  976  1,469 
The matrix reflects those campaign mailing targets that are likely to donate. Unfortunately, we won’t always be correct and thus, will expend funds without a return. Targeting 1,469 households that are thought to be donors will only result with 976 donations. On the flipside, a cost savings of 549 mailouts will be achieved but not sending the campaign to targets who will not provide a donation. Thus, accuracy is important by correctly knowing who is or who is not a donor and is based on correctly determining those are not a donor and do not donate and those who are a donor and do donate is valuable and is calculated in as (526+976)/(549+1469). Thus, our accuracy is 74.43%.
The second part will deal with the size of the donation but for now, we are using past experience which reflects an average donation of $14.50 and a cost of $2.00 each. Thus, our margin is ($14.50 * 976) – ($2.00 * 1,469) = $11,214. It is easy to understand why it is important to mail to those donors who are most likely to provide a donation.
A summary by each machine learning method is as follows:
Logistic Regression
Logistic regression models the probability of a binary response based on one or more predictor variables. For instance, whether someone will donate to the charity or will not donate.
Log models were created based on earlier models. The first model reflecting 21 variables and the second reflecting 9 variables that were selected based on their significance. The plot reflects the log function based on the profit.
Model: Log  Number mailed  Profit  Confusion Matrix  Accuracy  
Log1  1,291  $ 11,642.50 

83.74628%


Log2  1,389  11,533.50 

79.48464%

Variables Log2: chld + home + reg2 + I(hinc^2) + wrat + tdon + incm + tlag + tgif
Linear Discriminant Analysis (LDA)
LDA uses as Bayes classifier and a threshold of the posterior probability to assign observations to a default class by plugging in estimates for several parameters including the weighted average of the sample variances for each class and the number of observations. The first LDA model uses 21 variables including a transformation on hinc. LDA2 uses a subset of the most significant variables.
Model: LDA  Number mailed  Profit  Confusion Matrix  Accuracy  
LDA1  1,329  $ 11,624.50 

82.25966%  
LDA2  1,348  11,557.50 

81.11992% 
Variables in LDA2: reg1, reg2, reg3, chld, home, I(hinc^2), wrat, npro, tdon, tlag
Quadratic Discriminant Analysis (QDA)
Quadratic Discriminant Analysis is similar to LDA as it assumes that the observations from each class are drawn from a Gaussian distribution creates a prediction based on plugging in parameter estimates into Bayes’ theorem. However, QDA uses a quadratic function and creates its own covariance matrix unlike LDA making QDA more flexible with a lower variance.
In each of the QDA models, a different set of variables were used with QDA1 using nine variables, QDA using all 21 variables and QDA3 using seven variables.
Model: QDA  Number mailed  Profit  Confusion Matrix  Accuracy  
QDA1  1,469  $ 11,214.00 

83.25074%  
QDA2  1,372  11,219.50 

77.94846%  
QDA3  1,402  11,232.00 

76.95738% 
Variables: QDA1: reg1, reg2, home, chld, I(hinc^2), wrat, npro, tdon, tlag; QDA3: chld, home, wrat, hinc, reg2, tdon, incm
K=Nearest Neighbors (KNN)
KNN assigns a weighting so the nearer ‘neighbors’ contribute more to the average than observations further away. A small K value will provide the most flexible fit and have a low bias but high variance due to the prediction relying on one observation and vice versa, a larger K value will provide a smoother and less variable fit but may cause bias by masking some of the structure.
Model: KNN  Number mailed  Profit  Confusion Matrix  Accuracy  
KNN k=1  1,037  $ 11,306.00 

88.15659%


KNN k=2  1,114  11,155.50 

86.76908%


KNN k=3  1,101  11,573.00 

90.0892%


KNN k=4  1,101  11,573.50 

88.80079%


KNN k=5  1,130  11,718.00 

90.03964%


KNN k=6  1,142  11,737.50 

89.74232%


KNN k=7  1,142  11,813.50 

90.58474%


KNN k=9  1,150  11,852.00 

84.8696%


KNN k=10  1,157  11,765.50 

89.39544%

Generalized Additive Models (GAM)
GAM models go beyond linear models by allowing nonlinear functions of each variable while being additive, meaning that the effect of changes in a predictor on the response is independent of the values of the other predictors. Thus, one of the assumptions is that interaction amongst various marketing activities are not creating some type of synergy whereby we need to consider an interaction term for example, when there is a mailing campaign combined with radio, television or other marketing activities. GAM uses smoothing splines via backfitting which allows multiple predictors by repeatedly updating the fit for each predictor.
Model: GAM  Number mailed  Profit  Confusion Matrix  Accuracy  
1  1,036  $10,818.50 

87.56194252%

Variables: reg1, reg2, home, chld, I(hinc^2), wrat, npro, tdon , tlag , tdon
Decision Trees
Decision trees can be applied to both regression and classification problems and are commonly used for their ease of interpretability. The top of the tree is split into branches based of the best split at that particular junction without looking forward to determine if a more optimized split should be completed which is considered to be a ‘greedy’ approach. At each split the predictor space is split based on the greatest possible reduction in the RSS. As the training set is likely to create a too complex tree and overfit the data, an alternative is to build the tree as long as the decrease in the RSS exceeds a threshold.
Various decision trees were also modeled including 16, 9 and 5 terminal nodes. The results are as follows:
Model  Number mailed  Profit  Confusion Matrix  Accuracy  
Decision tree – 15  1,168  $ 11,149.00 

84.83647%


Decision tree – 9  962  10,038.50 

84.5887%


Decision tree – 5  1,078  10,212.50 

81.61546%

Bagging
Bagging focuses on reducing the variance of a statistical learning method such as decision trees. Bagging can reduce the high variance and low bias of decision trees by combining hundreds or thousands of trees into one procedure by taking repeated samples from the training set and averaging all of the predictions.
Model: Bagging  Number mailed  Profit  Confusion Matrix  Accuracy  
Mtry=20  1,037  $ 11,063.00 

88.8999%  
Mtry=10  1,050  11,167.50 

89.14767%  
Mtry=6  1,063  11,301.00 

89.5441%  
Mtry=5  1,055  11,143.00 

88.80079% 
Random Forest
Improving bagging by tweaking the splits in the decision tree is the basis for Random Forests. Based on a random sample of predictors is chosen from the full set of predictors with a fresh sample of predictors being taken at each split and thus, makes the average of the resulting trees less variable and more reliable.
Additionally, the Importance function in R, provides a list of the most important variables. As shown below, this list was used in the choice of variables in each of the Random Forest Models as reflected with the ‘mtry’ selection. For instance, mtry=2 uses the two best variables, chld and home.
Model: Random Forest  Number mailed  Profit  Confusion Matrix  Accuracy  
Mtry=5  1,059  $ 11,149.50 

87.0998%


Mtry=3  1,065  11,152.00 

88.60258%


Mtry=2  1,101  11,341.00 

88.50347%

Neural Network
A neural network mimics the learning pattern of natural biological neural networks. They begin with a single perceptron that receives inputs, applies a weighting then passes into a activation function to produce an output which is then layered to create a network. Hidden layers are the layers between the input and output layers where you cannot see the inputs or outputs (Portilla, Jose. 2016. A Beginner’s Guide to Neural Networks with R! Retrieved from kdnuggets.com).
Model  Number mailed  Profit  Confusion Matrix  Accuracy  
Neural Network – 10 10 50  1,047  $ 10,767.50 

86.57086224%


Neural Network – 10 10  1,054  10,855.00 

86.86818632%

Support Vector Machine
Support vector machines are intended for binary classifications such as donr and when there are nonlinear class boundaries which is addresses by enlarging the feature space using polynomial functions as predictors using kernels. A kernel is a function that quantifies the similarity of two observations. A radial kernel is effected by nearby training observations but there are also other types including linear, polynomial and sigmoid.
Model: Support Vector Machine  Number mailed  Profit  Confusion Matrix  Accuracy  
Tune.out Kernel=Linear  816  $ 8,750.00 

81.0208127%


Tune.out2 Kernel=Linear  1,060  10,422.50 

83.6967294%


Tune.out3 Kernel=Linear  1,056  10,401.50 

83.6967294%


Tune.out4
Kernel=polynomial 
1,068  9,957.00


80.227948%

Prediction Modeling
Goal: Minimize the prediction error based on the donation amount
Quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation. This is measured using the mean squared error (MSE). The MSE will be small if the predicted responses are close to the true responses.
Least Squares Regression
Least squares regression measures closeness by minimizing the sum of squared errors by minimizing the residuals
Three models using PLS regression with LS1 using all of the variables, LS2 using all except genf and wrat but added the transformed hinc2 and LS3 used the most significant factors as determined by the Random Forest model.
Model  Mean Prediction Error  Standard Error 
LS1  1.867523003  0.1696615221 
LS2  1.973121015  0.1720099842 
LS3  1.857983465  0.1699141406 
Variables: LS3 chld, hinc, reg2, home, wrat, tdon, tgif, incm, npro, tlag, avhv, inca, plow, agif, lgif, rgif, reg1, reg4, reg3).
Principal Component Regression
The principal Component Regression reflects the trade between the number of components that explain most of the variability in the data that is fit using least squares.
Training Variance Explained by the number of components is shown below which reflects the percentage of variance explained by the number of different components i.e. using half of the components explains 77.67% of all variance. The plot also reflects ‘elbows’ where the variance is:
1  2  3  4  5  6  7  8  9  10  
X  16.08851  27.91  36.73  45.01  51.11  56.8  62.35  67.59  72.71  77.67 
damt  0.02819  28.46  28.54  36.52  46.58  47.08  48.92  49.23  49.36  49.57 
11  12  13  14  15  16  17  18  19  20  
X  82.46  87.1  90.46  92.7  94.8  96.27  97.61  98.64  99.57  100 
damt  49.77  51.17  51.98  51.98  56.49  56.5  56.58  56.58  57.18  57.22 
Model – # Components  Mean Prediction Error  Standard Error 
PCR – 2  2.953948616  0.2225891948 
PCR – 3  2.946779495  0.2230407325 
PCR – 5  2.15543888  0.1864263297 
PCR – 10  2.079405939  0.1839018125 
PCR – 15  1.865496754  0.1698902112 
PCR – 20  1.867523003  0.1696615221 
The table above reflects standardized predictors with a tenfold crossvalidation error for each possible value.
Based on the MPE, a PCR with 15 components results in the lowest MPE.
Partial Least Squares (PLS) Regression
Partial Lease Squares is a dimension reduction method that first identifies a new set of features that are linear combination of the original features and then using least squares fits a linear model. It considers the predictor variables that are most strongly related to the response variable.
PLS – # of Components  Mean Prediction Error  Standard Error 
6  1.868145  0.1700303 
3  1.815055  0.1701451 
2  1.852769  0.1712954 
Best Subset Selection with Kfold CrossValidation
Best subset selection fits a separate least squares regression to each possible combination of the predictors and chooses the best model based on the smallest RSS.
Results in a tenvariable model consisting of: reg3, reg4, home, chld, hinc, incm, plow, npro, rgif, agif.
Model – Best Subset w/ kFold Cross Validation  Mean Prediction Error  Standard Error 
10 Variables  1.812687159  0.1686750592 
The best subset selection plot, as shown below, reflects the top row with all of the variables according to the optimal model. The optimal models are based on four different statistical methods, Rsquared, Adjusted rsquared, Mallow’s Cp and Bayesian Information Criterion (BIC).
Ridge Regression
Expanded the grid from to cover a whole range of scenarios. The variables are standardized by default. Using the best function, the best value of lambda is applied using crossvalidation.
Ridge Regression Model  Mean Prediction Error  Standard Error 
Best = 0.1107589  1.873351  0.1711236 
The Lasso
As ridge regression will always build a model with all of the predictors, Lasso overcomes this by performing variable selection that lead to the smallest RSS subject to a set constraint. Thus, for every value of there is a lasso coefficient estimate. It is closely related to the best subset selection.
Lasso Model  Mean Prediction Error  Standard Error 
Best = 0.00877745  1.86133  0.1694185 
Fitting a lasso model, we can see that depending on the choice of the tuning parameter, some of the coefficients will be exactly zero.
Results
The best models for classification and prediction based are addressed in two separate parts.
Classification
As reflected above in the ten classification models (Logistic Regression, GAM, LDA, Log, Decision Trees, QDA, Bagging, Random Forest, KNN, Neural Network and Support Vector Machine), the results on the training and test / validation data provided different results in determining how to maximize the gross margin on the marketing campaign.
Based on the accuracy on determine which donors are most likely to donate, the top three models are:
 KNearest Neighbors with k=7 resulting in 90.58% accuracy on the test data
 Bagging using 6 variables at each split resulting in 89.54% accuracy on the test data
 Random Forest with 5 variables at each split resulting in 88.80% accuracy on the test data
The best classification models based on Gross Margin are as follows:
 KNearest Neighbors with k= 9 resulting in $11,852.00
 Logarithmic resulting in $11,642.50
 Linear Discriminant Analysis resulting in $11,624.50
For future data collection, some variables have proven to be more important than other variables which may assist in reducing costs and /or replacing with other variables that may be more beneficial to the marketing efforts. For example, Random Forest determined that the number of children a donor has and whether they are a home owner were important in determining if they would donate. Additionally, while most of the regions did not play a huge role in determining whether a donor would donate, region 2 was more important than the other three regional variables included in the data set.
Prediction
As reflected above in the six prediction models (Partial Least Squares, Principal Component Regression, Best Subset, Ridge and Lasso), the Mean Prediction Error (MPE) and Standard Error (SE) on the different data sets provided different results in determining the amount of the donation. The top three prediction models were:
 Best Subset with a MPE of 1.8126 and a SE of .168675
 Partial Least Squares with 3 with a MPE of 1.8150 and a SE of .1701451
 Lasso with a MPE of 1.86133 and a SE of .1694185
In order to maximize the value of the list of people who have donated to the charity we need to determine the two parts of the equation a) who will donate and b) how much. This is accomplished by applying various models to determine which model predicts who will donate which is reflected in the gross margin value of the model and the amount of the donation using the mean prediction error. Based on the above, the best models to use are a) KNearest Neighbors with k=7 to determine who will donate and b) the best model for the donation amount which is the Best Subset Selection based on ten variables and also has the lowest standard error which tells us the average amount of the differences from the actual values.
Machine Learning: Charity Donor Analysis Conclusions
Utilizing the various training, validation and test datasets has provided insight into the type of data that is collected by Charity Inc. and how it can be utilized to benefit the company’s marketing campaign. The exploratory data analysis reflected the different types of data that were contained within the datasets along with their statistical information and plots for ease of reference.
Next, a variety of machine learning and regular models were produced for both classification and prediction modeling. The majority of the models detailed various iterations along with the number of mailouts, profit, confusion matrix and accuracy ratio. While there are numerous other models that could also be included along with additional iterations and transformations of variables this version has tried to remain easily interpreted.
Based on the analysis, the KNearest Neighbor is the best model for prediction error while the Best Subset Selection model is best for determining which people on the mailing list are most likely to donate. Based on the test data, the marketing campaign should be focused on 354 donors who will on average donate $14.50 for a total predicted donation amount of $5,132. Additionally, through the modelling efforts contained within, Charity Inc. can be more confident in determining who should be targeted to maximize revenues while minimizes costs.
In future iterations, segregating donors based on their projected contributions could be helpful in customizing mailout and mailout materials (and related costs) to future maximize the charities marketing efforts and resulting donations.