Machine Learning: Charity Donor Analysis

Machine Learning:  Charity Donor Analysis

Introduction

A charitable organization wishes to develop a machine learning model to improve the cost effectiveness of their direct marketing campaigns to previous donors. The recent mailing records reflect an overall 10% response rate with an average donation of $14.50. The cost to produce and send each mail is $2. The data consists of 3,984 training observations, 2,018 validation observations and 2,007 test observations. Weighted sampling has been used which over-represents the responders so that the training and validation samples have approximately equal numbers of donors and non-donors. Additionally, a prediction model to predict the expected gift amounts from donors should also be built based on the records of donors only.

Analysis

Exploratory Data Analysis

Based on R’s describe function, there are 22 variables and 8,009 observations in the initial dataset with no missing data. The observations reflect prior mailing data including the whether the mailing was successful and a donation was received, the amount of the donation along with numerous other aspects that may or may not be helpful in determining who to send future donation requests to in order to maximize the amount of donations. Categorical and numerical variables along with a brief description and statistical information are shown below.

Categorical Variables:

Var Description Mean SD Median Min Max skew Estimate Std. Error t value Pr(>|t|)
reg1 Geographic region 0.2 0.4 0 0 1 1.5 1.44E-01 1.38E-02 10.462 <2.00E-16 ***
reg2 Geographic region 0.32 0.47 0 0 1 0.78 2.92E-01 1.23E-02 23.793 <2.00E-16 ***
reg3 Geographic region 0.13 0.34 0 0 1 2.15 4.74E-03 1.59E-02 0.299 0.76505
reg4 Geographic region 0.14 0.35 0 0 1 2.08 2.36E-02 1.54E-02 1.53 0.12609
home Homeowner; 1=homeowner 0= Not 0.87 0.34 1 0 1 -2.16 3.23E-01 1.37E-02 23.583 <2.00E-16 ***
wrat Wealth Rating 0-9; 9 highest 6.91 2.43 8 0 9 -1.35 3.50E-02 1.87E-03 18.651 <2.00E-16 ***
genf Gender; 1=M; 1=F 0.61 0.49 1 0 1 -0.43 -1.54E-03 8.90E-03 -0.173 0.86246
donr Donar; 1-Donor; 0=Non-donor 0.5 0.5 0 0 1 0

Numerical Variables:

Var Description Mean SD Median Min Max skew Estimate Std. Error t value Pr(>|t|)
chld Number of children 1.72 1.4 2 0 5 0.27 -1.65E-01 3.11E-03 -52.918 <2.00E-16 ***
hinc Household Income 7 categories 3.91 1.47 4 1 7 0.01 3.42E-01 1.24E-02 27.476 <2.00E-16 ***
I(hinc^2) -4.24E-02 1.53E-03 -27.643 <2.00E-16 ***
avhv Average Home Value in 000’s 5.14 0.37 5.13 3.87 6.57 0.14 1.35E-02 2.23E-02 0.604 0.5459
incm Median Family Income in 000’s 43.47 24.71 38 3 287 2.05 1.59E-03 3.74E-04 4.262 2.06E-05 ***
inca Average Family Income in 000’s 56.43 24.82 51 12 305 1.94 -3.27E-05 4.32E-04 -0.076 0.93969
plow % categorized as “low income” 14.23 13.41 10 0 87 1.36 -1.40E-03 5.08E-04 -2.76 0.00581 **
npro Lifetime # of promotions rec’d to date 60.03 30.35 58 2 164 0.31 1.46E-03 2.06E-04 7.103 1.36E-12 ***
tgif $ value of lifetime gifts to date 113.07 85.48 89 23 2057 6.55 1.31E-04 7.31E-05 1.786 0.07422 .
lgif $ value of largest gift to date 22.94 29.95 16 3 681 7.81 -5.88E-05 2.22E-04 -0.265 0.79094
rgif $ value of most recent gift 15.66 12.43 12 1 173 2.63 -3.46E-04 5.65E-04 -0.612 0.54062
tdon # months since last donation 18.86 5.78 18 5 40 1.1 -5.27E-03 7.77E-04 -6.78 1.32E-11 ***
tlag # of months between 1st and 2nd gift 6.36 3.7 5 1 34 2.42 -1.23E-02 1.20E-03 -10.29 <2.00E-16 ***
agif Average $ value of gifts to date 11.68 6.57 10.23 1.29 72.27 1.78 1.10E-03 9.78E-04 1.12 0.26285
damt Donation amount in $’s 7.21 7.36 0 0 27 0.12

 

Machine Learning: Number of Children

Machine Learning: Number of Children

Machine Learning: Donation Times

Machine Learning: Donation Times

Machine Learning: Wealth Rating Bar Chart

Machine Learning: Wealth Rating Bar Chart

Machine Learning: Gender Bar Chart

Machine Learning: Gender Bar Chart

Machine Learning: 5 variables

Machine Learning: 5 variables

Machine Learning: Kernel Density of AVHV

Machine Learning: Kernel Density of AVHV

 Correlations

Reviewing the correlation between variables and donation amount (damt) some of the larger correlations are as follows:

ID reg1 reg2 reg3 reg4 home chld hinc genf wrat
tdon 0.0693 -0.0180 -0.0168 -0.0157 0.0222 0.0162 0.0602 0.0147 -0.0033 -0.0215
donr 0.0308 0.0565 0.2471 -0.1043 -0.1263 0.2890 -0.5308 0.0277 -0.0173 0.2493
damt 0.0277 0.0469 0.2115 -0.0837 -0.0887 0.2877 -0.5513 0.0453 -0.0190 0.2429
avhv incm inca plow npro tgif lgif rgif tdon tlag
avhv 1.0000 0.6939 0.8072 -0.7118 0.0030 0.0178 -0.0167 -0.0052 -0.0064 -0.0061
incm 0.6939 1.0000 0.8729 -0.6555 0.0171 0.0417 0.0067 0.0027 -0.0139 -0.0213
inca 0.8072 0.8729 1.0000 -0.6346 0.0183 0.0354 -0.0075 -0.0054 -0.0163 -0.0169
tgif 0.0178 0.0417 0.0354 -0.0176 0.7266 1.0000 0.1734 0.0736 -0.0113 -0.0117
lgif -0.0167 0.0067 -0.0075 0.0059 -0.0013 0.1734 1.0000 0.6961 0.0036 0.0168
rgif -0.0052 0.0027 -0.0054 -0.0136 -0.0125 0.0736 0.6961 1.0000 0.0063 0.0122
agif 0.0000 0.0103 -0.0002 -0.0137 0.0022 0.0558 0.6096 0.7053 0.0080 0.0299
agif donr damt
chld 0.0149 -0.5308 -0.5513
rgif 0.7053 0.0149 0.0851
donr 0.0095 1.0000 0.9826

Logistic Regression

Some of the significant variables are reflected below but additional insight into which variables are more significant based on the model will be provided in the discussion of the related model.

Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5184 0.0661 7.845 4.34E-15 ***
reg1 0.6449 0.0716 9.012 < 2.00E-16 ***
reg2 1.4807 0.0842 17.577 < 2.00E-16 ***
home 1.3657 0.0836 16.342 < 2.00E-16 ***
chld -2.3711 0.0854 -27.779 < 2.00E-16 ***
I(hinc^2) -1.0750 0.0530 -20.283 < 2.00E-16 ***
wrat 0.9726 0.0667 14.589 < 2.00E-16 ***
npro 0.4552 0.0786 5.790 7.05E-09 ***
tdon -0.2945 0.0603 -4.883 1.04E-06 ***
tlag -0.5543 0.0602 -9.207 < 2.00E-16 ***

Classification Modeling

Goal:  Maximize the gross margin (expected revenue less the costs) on the marketing campaign

  0 ≠ donor 1 = donor Total
0 526 23    549
1 = mail 493 976 1,469

Utilizing the data, various machine learning modelling techniques were applied to determine the number of households to receive the marketing materials that are most likely to be a donor.  The outcome of each technique results in a confusion matrix that is used to determine the percentage of mailers that are successful versus those that do not ultimately result in a donation.  For example, in Quadratic Discriminant Analysis (QDA) the resulting confusion matrix is:

  0 ≠ donor 1 = donor Total
0 526 23    549
1 = mail 493 976 1,469

The matrix reflects those campaign mailing targets that are likely to donate.  Unfortunately, we won’t always be correct and thus, will expend funds without a return.  Targeting 1,469 households that are thought to be donors will only result with 976 donations. On the flipside, a cost savings of 549 mailouts will be achieved but not sending the campaign to targets who will not provide a donation.  Thus, accuracy is important by correctly knowing who is or who is not a donor and is based on correctly determining those are not a donor and do not donate and those who are a donor and do donate is valuable and is calculated in as (526+976)/(549+1469).  Thus, our accuracy is 74.43%.

The second part will deal with the size of the donation but for now, we are using past experience which reflects an average donation of $14.50 and a cost of $2.00 each.  Thus, our margin is ($14.50 * 976) – ($2.00 * 1,469) = $11,214.  It is easy to understand why it is important to mail to those donors who are most likely to provide a donation.

A summary by each machine learning method is as follows:

Logistic Regression

Logistic regression models the probability of a binary response based on one or more predictor variables.  For instance, whether someone will donate to the charity or will not donate.

Log models were created based on earlier models.  The first model reflecting 21 variables and the second reflecting 9 variables that were selected based on their significance.  The plot reflects the log function based on the profit.

Model: Log Number mailed Profit Confusion Matrix Accuracy
Log1 1,291 $ 11,642.50
Pred: 0 Pred: 1
Actual: 0 709 18
Actual 1: 310 981
83.74628%

 

Log2 1,389   11,533.50
Pred: 0 Pred: 1
Actual: 0 617 12
Actual 1: 402 987
79.48464%

 

Machine Learning: Logistic Regression

Machine Learning: Logistic Regression

Variables Log2:  chld + home + reg2 + I(hinc^2) + wrat + tdon + incm + tlag + tgif

Linear Discriminant Analysis (LDA)

LDA uses as Bayes classifier and a threshold of the posterior probability to assign observations to a default class by plugging in estimates for several parameters including the weighted average of the sample variances for each class and the number of observations.  The first LDA model uses 21 variables including a transformation on hinc.  LDA2 uses a subset of the most significant variables.

Model: LDA Number mailed Profit Confusion Matrix Accuracy
LDA1 1,329 $ 11,624.50
Pred: 0 Pred: 1
Actual: 0 675 14
Actual 1: 344 985
82.25966%
LDA2 1,348 11,557.50
Pred: 0 Pred: 1
Actual: 0 654 16
Actual 1: 365 983
81.11992%

Variables in LDA2:  reg1, reg2, reg3, chld, home, I(hinc^2), wrat, npro, tdon, tlag

Quadratic Discriminant Analysis (QDA)

Quadratic Discriminant Analysis is similar to LDA as it assumes that the observations from each class are drawn from a Gaussian distribution creates a prediction based on plugging in parameter estimates into Bayes’ theorem.  However, QDA uses a quadratic function and creates its own covariance matrix unlike LDA making QDA more flexible with a lower variance.

In each of the QDA models, a different set of variables were used with QDA1 using nine variables, QDA using all 21 variables and QDA3 using seven variables.

Model:  QDA Number mailed Profit Confusion Matrix Accuracy
QDA1 1,469 $ 11,214.00
Pred: 0 Pred: 1
Actual: 0 761 80
Actual 1: 258 919
83.25074%
QDA2 1,372 11,219.50
Pred: 0 Pred: 1
Actual: 0 795 103
Actual 1: 224 896
77.94846%
QDA3 1,402 11,232.00
Pred: 0 Pred: 1
Actual: 0 610 36
Actual 1: 434 968
76.95738%

Variables:  QDA1:  reg1, reg2, home, chld, I(hinc^2), wrat, npro, tdon, tlag;     QDA3:  chld, home, wrat, hinc, reg2, tdon, incm

K=Nearest Neighbors (KNN)

KNN assigns a weighting so the nearer ‘neighbors’ contribute more to the average than observations further away.  A small K value will provide the most flexible fit and have a low bias but high variance due to the prediction relying on one observation and vice versa, a larger K value will provide a smoother and less variable fit but may cause bias by masking some of the structure.

Model: KNN Number mailed Profit Confusion Matrix Accuracy
KNN k=1 1,037 $ 11,306.00
Pred: 0 Pred: 1
Actual: 0 847 67
Actual 1: 172 932
  88.15659%

 

KNN k=2 1,114    11,155.50
Pred: 0 Pred: 1
Actual: 0 828 76
Actual 1: 191 923
  86.76908%

 

KNN k=3 1,101    11,573.00
Pred: 0 Pred: 1
Actual: 0 868 49
Actual 1: 151 950
  90.0892%

 

KNN k=4 1,101    11,573.50
Pred: 0 Pred: 1
Actual: 0 837 44
Actual 1: 182 955
  88.80079%

 

KNN k=5 1,130    11,718.00
Pred: 0 Pred: 1
Actual: 0 853 35
Actual 1: 166 964
  90.03964%

 

KNN k=6 1,142    11,737.50
Pred: 0 Pred: 1
Actual: 0 844 32
Actual 1: 175 967
  89.74232%

 

KNN k=7 1,142    11,813.50
Pred: 0 Pred: 1
Actual: 0 857 28
Actual 1: 162 971
  90.58474%

 

KNN k=9 1,150    11,852.00
Pred: 0 Pred: 1
Actual: 0 845 23
Actual 1: 174 976
  84.8696%

 

KNN k=10 1,157    11,765.50
Pred: 0 Pred: 1
Actual: 0 833 28
Actual 1: 186 971
  89.39544%

 

 

Generalized Additive Models (GAM)

GAM models go beyond linear models by allowing non-linear functions of each variable while being additive, meaning that the effect of changes in a predictor on the response is independent of the values of the other predictors.  Thus, one of the assumptions is that interaction amongst various marketing activities are not creating some type of synergy whereby we need to consider an interaction term for example, when there is a mailing campaign combined with radio, television or other marketing activities.  GAM uses smoothing splines via backfitting which allows multiple predictors by repeatedly updating the fit for each predictor.

Model: GAM Number mailed Profit Confusion Matrix Accuracy
1 1,036 $10,818.50
Pred: 0 Pred: 1
Actual: 0 880 112
Actual 1: 139 887
87.56194252%

 

Machine Learning: Decision Trees Nodes

Machine Learning: Decision Trees Nodes

Variables: reg1, reg2, home, chld, I(hinc^2), wrat, npro, tdon , tlag , tdon

Decision Trees

Decision trees can be applied to both regression and classification problems and are commonly used for their ease of interpretability.  The top of the tree is split into branches based of the best split at that particular junction without looking forward to determine if a more optimized split should be completed which is considered to be a ‘greedy’ approach.  At each split the predictor space is split based on the greatest possible reduction in the RSS.  As the training set is likely to create a too complex tree and overfit the data, an alternative is to build the tree as long as the decrease in the RSS exceeds a threshold.

Various decision trees were also modeled including 16, 9 and 5 terminal nodes.  The results are as follows:

Model Number mailed Profit Confusion Matrix Accuracy
Decision tree – 15 1,168 $ 11,149.00
Pred: 0 Pred: 1
Actual: 0 783 70
Actual 1: 236 929
84.83647%

 

Decision tree – 9 962   10,038.50
Pred: 0 Pred: 1
Actual: 0 882 174
Actual 1: 137 825
84.5887%

 

Decision tree – 5 1,078   10,212.50
Pred: 0 Pred: 1
Actual: 0 794 146
Actual 1: 225 853
81.61546%

 

Machine Learning: Decision Trees

Machine Learning: Decision Trees

Bagging

Bagging focuses on reducing the variance of a statistical learning method such as decision trees.  Bagging can reduce the high variance and low bias of decision trees by combining hundreds or thousands of trees into one procedure by taking repeated samples from the training set and averaging all of the predictions.

Model:  Bagging Number mailed Profit Confusion Matrix Accuracy
Mtry=20 1,037 $ 11,063.00
Pred: 0 Pred: 1
Actual: 0 888 93
Actual 1: 131 906
88.8999%
Mtry=10 1,050   11,167.50
Pred: 0 Pred: 1
Actual: 0 884 84
Actual 1: 135 915
89.14767%
Mtry=6 1,063 11,301.00
Pred: 0 Pred: 1
Actual: 0 882 74
Actual 1: 137 925
89.5441%
Mtry=5 1,055  11,143.00
Pred: 0 Pred: 1
Actual: 0 879 86
Actual 1: 140 913
88.80079%

 

Random Forest

Improving bagging by tweaking the splits in the decision tree is the basis for Random Forests.  Based on a random sample of predictors is chosen from the full set of predictors with a fresh sample of predictors being taken at each split and thus, makes the average of the resulting trees less variable and more reliable.

Additionally, the Importance function in R, provides a list of the most important variables.  As shown below, this list was used in the choice of variables in each of the Random Forest Models as reflected with the ‘mtry’ selection.  For instance, mtry=2 uses the two best variables, chld and home.

Machine Learning: Random Forest

Machine Learning: Random Forest

Model:  Random Forest Number mailed Profit Confusion Matrix Accuracy
Mtry=5   1,059 $ 11,149.50
Pred: 0 Pred: 1
Actual: 0 878 85
Actual 1: 141 914
  87.0998%

 

Mtry=3   1,065   11,152.00
Pred: 0 Pred: 1
Actual: 0 873 84
Actual 1: 146 915
  88.60258%

 

Mtry=2   1,101 11,341.00
Pred: 0 Pred: 1
Actual: 0 852 65
Actual 1: 167 934
  88.50347%

 

Neural Network

A neural network mimics the learning pattern of natural biological neural networks.  They begin with a single perceptron that receives inputs, applies a weighting then passes into a activation function to produce an output which is then layered to create a network.  Hidden layers are the layers between the input and output layers where you cannot see the inputs or outputs (Portilla, Jose. 2016.   A Beginner’s Guide to Neural Networks with R!  Retrieved from kdnuggets.com).

Model Number mailed Profit Confusion Matrix Accuracy
Neural Network –  10 10 50   1,047 $ 10,767.50
Pred: 0 Pred: 1
Actual: 0 862 157
Actual 1: 114 885
  86.57086224%

 

Neural Network –  10 10   1,054    10,855.00
Pred: 0 Pred: 1
Actual: 0 859 160
Actual 1: 105 894
  86.86818632%

 

 

Machine Learning: Neural Network 10 10

Machine Learning: Neural Network 10 10

 

Machine Learning: Neural Network 10 10 50

Machine Learning: Neural Network 10 10 50

Support Vector Machine

Support vector machines are intended for binary classifications such as donr and when there are non-linear class boundaries which is addresses by enlarging the feature space using polynomial functions as predictors using kernels.  A kernel is a function that quantifies the similarity of two observations.  A radial kernel is effected by nearby training observations but there are also other types including linear, polynomial and sigmoid.

Model:  Support Vector Machine Number mailed Profit Confusion Matrix Accuracy
Tune.out Kernel=Linear 816 $  8,750.00
Pred: 0 Pred: 1
Actual: 0 919 283
Actual 1: 100 716
81.0208127%

 

Tune.out2 Kernel=Linear 1,060    10,422.50
Pred: 0 Pred: 1
Actual: 0 824 134
Actual 1: 195 865
83.6967294%

 

Tune.out3 Kernel=Linear 1,056    10,401.50
Pred: 0 Pred: 1
Actual: 0 826 136
Actual 1: 193 863
83.6967294%

 

Tune.out4

Kernel=polynomial

1,068    9,957.00

 

Pred: 0 Pred: 1
Actual: 0 785 165
Actual 1: 234 834
 80.227948%

 

 

Prediction Modeling

Goal:  Minimize the prediction error based on the donation amount

Quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation.  This is measured using the mean squared error (MSE).  The MSE will be small if the predicted responses are close to the true responses.

Least Squares Regression

Least squares regression measures closeness by minimizing the sum of squared errors by minimizing the residuals

Three models using PLS regression with LS1 using all of the variables, LS2 using all except genf and wrat but added the transformed hinc2 and LS3 used the most significant factors as determined by the Random Forest model.

Model Mean Prediction Error Standard Error
LS1 1.867523003 0.1696615221
LS2 1.973121015 0.1720099842
LS3 1.857983465 0.1699141406

Variables: LS3 chld, hinc, reg2, home, wrat, tdon, tgif, incm, npro, tlag, avhv, inca, plow, agif, lgif, rgif, reg1, reg4, reg3).

Principal Component Regression

The principal Component Regression reflects the trade between the number of components that explain most of the variability in the data that is fit using least squares.

Training Variance Explained by the number of components is shown below which reflects the percentage of variance explained by the number of different components i.e. using half of the components explains 77.67% of all variance.  The plot also reflects ‘elbows’ where the variance is:

1 2 3 4 5 6 7 8 9 10
X 16.08851 27.91 36.73 45.01 51.11 56.8 62.35 67.59 72.71 77.67
damt 0.02819 28.46 28.54 36.52 46.58 47.08 48.92 49.23 49.36 49.57
11 12 13 14 15 16 17 18 19 20
X 82.46 87.1 90.46 92.7 94.8 96.27 97.61 98.64 99.57 100
damt 49.77 51.17 51.98 51.98 56.49 56.5 56.58 56.58 57.18 57.22

 

Model – # Components Mean Prediction Error Standard Error
PCR – 2 2.953948616 0.2225891948
PCR – 3 2.946779495 0.2230407325
PCR – 5 2.15543888 0.1864263297
PCR – 10 2.079405939 0.1839018125
PCR – 15 1.865496754 0.1698902112
PCR – 20 1.867523003 0.1696615221

 

Machine Learning: Principal Components Regression

Machine Learning: Principal Components Regression

The table above reflects standardized predictors with a ten-fold cross-validation error for each possible value.

Based on the MPE, a PCR with 15 components results in the lowest MPE.

Partial Least Squares (PLS) Regression

Partial Lease Squares is a dimension reduction method that first identifies a new set of features that are linear combination of the original features and then using least squares fits a linear model.  It considers the predictor variables that are most strongly related to the response variable.

PLS – # of Components Mean Prediction Error Standard Error
  6 1.868145 0.1700303
  3 1.815055 0.1701451
  2 1.852769 0.1712954

Best Subset Selection with K-fold Cross-Validation

Best subset selection fits a separate least squares regression to each possible combination of the predictors and chooses the best model based on the smallest RSS.

Results in a ten-variable model consisting of: reg3, reg4, home, chld, hinc, incm, plow, npro, rgif, agif.

Model – Best Subset w/ k-Fold Cross Validation Mean Prediction Error Standard Error
10 Variables 1.812687159 0.1686750592

The best subset selection plot, as shown below, reflects the top row with all of the variables according to the optimal model.  The optimal models are based on four different statistical methods, R-squared, Adjusted r-squared, Mallow’s Cp and Bayesian Information Criterion (BIC).

Machine Learning: Best Subset Selection

Machine Learning: Best Subset Selection

 

Ridge Regression

Expanded the grid from  to cover a whole range of scenarios. The variables are standardized by default.  Using the best function, the best value of lambda is applied using cross-validation.

Ridge Regression Model Mean Prediction Error Standard Error
Best  = 0.1107589 1.873351 0.1711236

Machine Learning: Ridge Regression

The Lasso

As ridge regression will always build a model with all of the predictors, Lasso overcomes this by performing variable selection that lead to the smallest RSS subject to a set constraint.  Thus, for every value of   there is a lasso coefficient estimate. It is closely related to the best subset selection.

Lasso Model Mean Prediction Error Standard Error
Best  = 0.00877745 1.86133 0.1694185

Fitting a lasso model, we can see that depending on the choice of the tuning parameter, some of the coefficients will be exactly zero.

Machine Learning: Lasso

Lasso

Results

The best models for classification and prediction based are addressed in two separate parts.

Classification

As reflected above in the ten classification models (Logistic Regression, GAM, LDA, Log, Decision Trees, QDA, Bagging, Random Forest, KNN, Neural Network and Support Vector Machine), the results on the training and test / validation data provided different results in determining how to maximize the gross margin on the marketing campaign.

Based on the accuracy on determine which donors are most likely to donate, the top three models are:

  • K-Nearest Neighbors with k=7 resulting in 90.58% accuracy on the test data
  • Bagging using 6 variables at each split resulting in 89.54% accuracy on the test data
  • Random Forest with 5 variables at each split resulting in 88.80% accuracy on the test data

The best classification models based on Gross Margin are as follows:

  • K-Nearest Neighbors with k= 9 resulting in $11,852.00
  • Logarithmic resulting in $11,642.50
  • Linear Discriminant Analysis resulting in $11,624.50

For future data collection, some variables have proven to be more important than other variables which may assist in reducing costs and /or replacing with other variables that may be more beneficial to the marketing efforts.  For example, Random Forest determined that the number of children a donor has and whether they are a home owner were important in determining if they would donate.  Additionally, while most of the regions did not play a huge role in determining whether a donor would donate, region 2 was more important than the other three regional variables included in the data set.

Prediction

As reflected above in the six prediction models (Partial Least Squares, Principal Component Regression, Best Subset, Ridge and Lasso), the Mean Prediction Error (MPE) and Standard Error (SE) on the different data sets provided different results in determining the amount of the donation.  The top three prediction models were:

  • Best Subset with a MPE of 1.8126 and a SE of .168675
  • Partial Least Squares with 3 with a MPE of 1.8150 and a SE of .1701451
  • Lasso with a MPE of 1.86133 and a SE of .1694185

In order to maximize the value of the list of people who have donated to the charity we need to determine the two parts of the equation a) who will donate and b) how much.  This is accomplished by applying various models to determine which model predicts who will donate which is reflected in the gross margin value of the model and the amount of the donation using the mean prediction error.  Based on the above, the best models to use are a) K-Nearest Neighbors with k=7 to determine who will donate and b) the best model for the donation amount which is the Best Subset Selection based on ten variables and also has the lowest standard error which tells us the average amount of the differences from the actual values.

 

Machine Learning:  Charity Donor Analysis Conclusions

Utilizing the various training, validation and test datasets has provided insight into the type of data that is collected by Charity Inc. and how it can be utilized to benefit the company’s marketing campaign.  The exploratory data analysis reflected the different types of data that were contained within the datasets along with their statistical information and plots for ease of reference.

Next, a variety of machine learning and regular models were produced for both classification and prediction modeling.  The majority of the models detailed various iterations along with the number of mailouts, profit, confusion matrix and accuracy ratio.  While there are numerous other models that could also be included along with additional iterations and transformations of variables this version has tried to remain easily interpreted.

Based on the analysis, the K-Nearest Neighbor is the best model for prediction error while the Best Subset Selection model is best for determining which people on the mailing list are most likely to donate.  Based on the test data, the marketing campaign should be focused on 354 donors who will on average donate $14.50 for a total predicted donation amount of $5,132.  Additionally, through the modelling efforts contained within, Charity Inc. can be more confident in determining who should be targeted to maximize revenues while minimizes costs.

In future iterations, segregating donors based on their projected contributions could be helpful in customizing mailout and mailout materials (and related costs) to future maximize the charities marketing efforts and resulting donations.

Leave a Reply