Machine Learning: Charity Donor Analysis
Machine Learning: Charity Donor Analysis
Introduction
A charitable organization wishes to develop a machine learning model to improve the cost effectiveness of their direct marketing campaigns to previous donors. The recent mailing records reflect an overall 10% response rate with an average donation of $14.50. The cost to produce and send each mail is $2. The data consists of 3,984 training observations, 2,018 validation observations and 2,007 test observations. Weighted sampling has been used which over-represents the responders so that the training and validation samples have approximately equal numbers of donors and non-donors. Additionally, a prediction model to predict the expected gift amounts from donors should also be built based on the records of donors only.
Analysis
Exploratory Data Analysis
Based on R’s describe function, there are 22 variables and 8,009 observations in the initial dataset with no missing data. The observations reflect prior mailing data including the whether the mailing was successful and a donation was received, the amount of the donation along with numerous other aspects that may or may not be helpful in determining who to send future donation requests to in order to maximize the amount of donations. Categorical and numerical variables along with a brief description and statistical information are shown below.
Categorical Variables:
Var | Description | Mean | SD | Median | Min | Max | skew | Estimate | Std. | Error | t value | Pr(>|t|) |
reg1 | Geographic region | 0.2 | 0.4 | 0 | 0 | 1 | 1.5 | 1.44E-01 | 1.38E-02 | 10.462 | <2.00E-16 | *** |
reg2 | Geographic region | 0.32 | 0.47 | 0 | 0 | 1 | 0.78 | 2.92E-01 | 1.23E-02 | 23.793 | <2.00E-16 | *** |
reg3 | Geographic region | 0.13 | 0.34 | 0 | 0 | 1 | 2.15 | 4.74E-03 | 1.59E-02 | 0.299 | 0.76505 | |
reg4 | Geographic region | 0.14 | 0.35 | 0 | 0 | 1 | 2.08 | 2.36E-02 | 1.54E-02 | 1.53 | 0.12609 | |
home | Homeowner; 1=homeowner 0= Not | 0.87 | 0.34 | 1 | 0 | 1 | -2.16 | 3.23E-01 | 1.37E-02 | 23.583 | <2.00E-16 | *** |
wrat | Wealth Rating 0-9; 9 highest | 6.91 | 2.43 | 8 | 0 | 9 | -1.35 | 3.50E-02 | 1.87E-03 | 18.651 | <2.00E-16 | *** |
genf | Gender; 1=M; 1=F | 0.61 | 0.49 | 1 | 0 | 1 | -0.43 | -1.54E-03 | 8.90E-03 | -0.173 | 0.86246 | |
donr | Donar; 1-Donor; 0=Non-donor | 0.5 | 0.5 | 0 | 0 | 1 | 0 |
Numerical Variables:
Var | Description | Mean | SD | Median | Min | Max | skew | Estimate | Std. | Error | t value | Pr(>|t|) |
chld | Number of children | 1.72 | 1.4 | 2 | 0 | 5 | 0.27 | -1.65E-01 | 3.11E-03 | -52.918 | <2.00E-16 | *** |
hinc | Household Income 7 categories | 3.91 | 1.47 | 4 | 1 | 7 | 0.01 | 3.42E-01 | 1.24E-02 | 27.476 | <2.00E-16 | *** |
I(hinc^2) | -4.24E-02 | 1.53E-03 | -27.643 | <2.00E-16 | *** | |||||||
avhv | Average Home Value in 000’s | 5.14 | 0.37 | 5.13 | 3.87 | 6.57 | 0.14 | 1.35E-02 | 2.23E-02 | 0.604 | 0.5459 | |
incm | Median Family Income in 000’s | 43.47 | 24.71 | 38 | 3 | 287 | 2.05 | 1.59E-03 | 3.74E-04 | 4.262 | 2.06E-05 | *** |
inca | Average Family Income in 000’s | 56.43 | 24.82 | 51 | 12 | 305 | 1.94 | -3.27E-05 | 4.32E-04 | -0.076 | 0.93969 | |
plow | % categorized as “low income” | 14.23 | 13.41 | 10 | 0 | 87 | 1.36 | -1.40E-03 | 5.08E-04 | -2.76 | 0.00581 | ** |
npro | Lifetime # of promotions rec’d to date | 60.03 | 30.35 | 58 | 2 | 164 | 0.31 | 1.46E-03 | 2.06E-04 | 7.103 | 1.36E-12 | *** |
tgif | $ value of lifetime gifts to date | 113.07 | 85.48 | 89 | 23 | 2057 | 6.55 | 1.31E-04 | 7.31E-05 | 1.786 | 0.07422 | . |
lgif | $ value of largest gift to date | 22.94 | 29.95 | 16 | 3 | 681 | 7.81 | -5.88E-05 | 2.22E-04 | -0.265 | 0.79094 | |
rgif | $ value of most recent gift | 15.66 | 12.43 | 12 | 1 | 173 | 2.63 | -3.46E-04 | 5.65E-04 | -0.612 | 0.54062 | |
tdon | # months since last donation | 18.86 | 5.78 | 18 | 5 | 40 | 1.1 | -5.27E-03 | 7.77E-04 | -6.78 | 1.32E-11 | *** |
tlag | # of months between 1st and 2nd gift | 6.36 | 3.7 | 5 | 1 | 34 | 2.42 | -1.23E-02 | 1.20E-03 | -10.29 | <2.00E-16 | *** |
agif | Average $ value of gifts to date | 11.68 | 6.57 | 10.23 | 1.29 | 72.27 | 1.78 | 1.10E-03 | 9.78E-04 | 1.12 | 0.26285 | |
damt | Donation amount in $’s | 7.21 | 7.36 | 0 | 0 | 27 | 0.12 |
Correlations
Reviewing the correlation between variables and donation amount (damt) some of the larger correlations are as follows:
ID | reg1 | reg2 | reg3 | reg4 | home | chld | hinc | genf | wrat | |
tdon | 0.0693 | -0.0180 | -0.0168 | -0.0157 | 0.0222 | 0.0162 | 0.0602 | 0.0147 | -0.0033 | -0.0215 |
donr | 0.0308 | 0.0565 | 0.2471 | -0.1043 | -0.1263 | 0.2890 | -0.5308 | 0.0277 | -0.0173 | 0.2493 |
damt | 0.0277 | 0.0469 | 0.2115 | -0.0837 | -0.0887 | 0.2877 | -0.5513 | 0.0453 | -0.0190 | 0.2429 |
avhv | incm | inca | plow | npro | tgif | lgif | rgif | tdon | tlag | |
avhv | 1.0000 | 0.6939 | 0.8072 | -0.7118 | 0.0030 | 0.0178 | -0.0167 | -0.0052 | -0.0064 | -0.0061 |
incm | 0.6939 | 1.0000 | 0.8729 | -0.6555 | 0.0171 | 0.0417 | 0.0067 | 0.0027 | -0.0139 | -0.0213 |
inca | 0.8072 | 0.8729 | 1.0000 | -0.6346 | 0.0183 | 0.0354 | -0.0075 | -0.0054 | -0.0163 | -0.0169 |
tgif | 0.0178 | 0.0417 | 0.0354 | -0.0176 | 0.7266 | 1.0000 | 0.1734 | 0.0736 | -0.0113 | -0.0117 |
lgif | -0.0167 | 0.0067 | -0.0075 | 0.0059 | -0.0013 | 0.1734 | 1.0000 | 0.6961 | 0.0036 | 0.0168 |
rgif | -0.0052 | 0.0027 | -0.0054 | -0.0136 | -0.0125 | 0.0736 | 0.6961 | 1.0000 | 0.0063 | 0.0122 |
agif | 0.0000 | 0.0103 | -0.0002 | -0.0137 | 0.0022 | 0.0558 | 0.6096 | 0.7053 | 0.0080 | 0.0299 |
agif | donr | damt | ||||||||
chld | 0.0149 | -0.5308 | -0.5513 | |||||||
rgif | 0.7053 | 0.0149 | 0.0851 | |||||||
donr | 0.0095 | 1.0000 | 0.9826 |
Logistic Regression
Some of the significant variables are reflected below but additional insight into which variables are more significant based on the model will be provided in the discussion of the related model.
Estimate | Std. | Error | z value | Pr(>|z|) | ||
(Intercept) | 0.5184 | 0.0661 | 7.845 | 4.34E-15 | *** | |
reg1 | 0.6449 | 0.0716 | 9.012 | < | 2.00E-16 | *** |
reg2 | 1.4807 | 0.0842 | 17.577 | < | 2.00E-16 | *** |
home | 1.3657 | 0.0836 | 16.342 | < | 2.00E-16 | *** |
chld | -2.3711 | 0.0854 | -27.779 | < | 2.00E-16 | *** |
I(hinc^2) | -1.0750 | 0.0530 | -20.283 | < | 2.00E-16 | *** |
wrat | 0.9726 | 0.0667 | 14.589 | < | 2.00E-16 | *** |
npro | 0.4552 | 0.0786 | 5.790 | 7.05E-09 | *** | |
tdon | -0.2945 | 0.0603 | -4.883 | 1.04E-06 | *** | |
tlag | -0.5543 | 0.0602 | -9.207 | < | 2.00E-16 | *** |
Classification Modeling
Goal: Maximize the gross margin (expected revenue less the costs) on the marketing campaign
0 ≠ donor | 1 = donor | Total | |
0 | 526 | 23 | 549 |
1 = mail | 493 | 976 | 1,469 |
Utilizing the data, various machine learning modelling techniques were applied to determine the number of households to receive the marketing materials that are most likely to be a donor. The outcome of each technique results in a confusion matrix that is used to determine the percentage of mailers that are successful versus those that do not ultimately result in a donation. For example, in Quadratic Discriminant Analysis (QDA) the resulting confusion matrix is:
0 ≠ donor | 1 = donor | Total | |
0 | 526 | 23 | 549 |
1 = mail | 493 | 976 | 1,469 |
The matrix reflects those campaign mailing targets that are likely to donate. Unfortunately, we won’t always be correct and thus, will expend funds without a return. Targeting 1,469 households that are thought to be donors will only result with 976 donations. On the flipside, a cost savings of 549 mailouts will be achieved but not sending the campaign to targets who will not provide a donation. Thus, accuracy is important by correctly knowing who is or who is not a donor and is based on correctly determining those are not a donor and do not donate and those who are a donor and do donate is valuable and is calculated in as (526+976)/(549+1469). Thus, our accuracy is 74.43%.
The second part will deal with the size of the donation but for now, we are using past experience which reflects an average donation of $14.50 and a cost of $2.00 each. Thus, our margin is ($14.50 * 976) – ($2.00 * 1,469) = $11,214. It is easy to understand why it is important to mail to those donors who are most likely to provide a donation.
A summary by each machine learning method is as follows:
Logistic Regression
Logistic regression models the probability of a binary response based on one or more predictor variables. For instance, whether someone will donate to the charity or will not donate.
Log models were created based on earlier models. The first model reflecting 21 variables and the second reflecting 9 variables that were selected based on their significance. The plot reflects the log function based on the profit.
Model: Log | Number mailed | Profit | Confusion Matrix | Accuracy | |||||||||
Log1 | 1,291 | $ 11,642.50 |
|
83.74628%
|
|||||||||
Log2 | 1,389 | 11,533.50 |
|
79.48464%
|
Variables Log2: chld + home + reg2 + I(hinc^2) + wrat + tdon + incm + tlag + tgif
Linear Discriminant Analysis (LDA)
LDA uses as Bayes classifier and a threshold of the posterior probability to assign observations to a default class by plugging in estimates for several parameters including the weighted average of the sample variances for each class and the number of observations. The first LDA model uses 21 variables including a transformation on hinc. LDA2 uses a subset of the most significant variables.
Model: LDA | Number mailed | Profit | Confusion Matrix | Accuracy | |||||||||
LDA1 | 1,329 | $ 11,624.50 |
|
82.25966% | |||||||||
LDA2 | 1,348 | 11,557.50 |
|
81.11992% |
Variables in LDA2: reg1, reg2, reg3, chld, home, I(hinc^2), wrat, npro, tdon, tlag
Quadratic Discriminant Analysis (QDA)
Quadratic Discriminant Analysis is similar to LDA as it assumes that the observations from each class are drawn from a Gaussian distribution creates a prediction based on plugging in parameter estimates into Bayes’ theorem. However, QDA uses a quadratic function and creates its own covariance matrix unlike LDA making QDA more flexible with a lower variance.
In each of the QDA models, a different set of variables were used with QDA1 using nine variables, QDA using all 21 variables and QDA3 using seven variables.
Model: QDA | Number mailed | Profit | Confusion Matrix | Accuracy | |||||||||
QDA1 | 1,469 | $ 11,214.00 |
|
83.25074% | |||||||||
QDA2 | 1,372 | 11,219.50 |
|
77.94846% | |||||||||
QDA3 | 1,402 | 11,232.00 |
|
76.95738% |
Variables: QDA1: reg1, reg2, home, chld, I(hinc^2), wrat, npro, tdon, tlag; QDA3: chld, home, wrat, hinc, reg2, tdon, incm
K=Nearest Neighbors (KNN)
KNN assigns a weighting so the nearer ‘neighbors’ contribute more to the average than observations further away. A small K value will provide the most flexible fit and have a low bias but high variance due to the prediction relying on one observation and vice versa, a larger K value will provide a smoother and less variable fit but may cause bias by masking some of the structure.
Model: KNN | Number mailed | Profit | Confusion Matrix | Accuracy | |||||||||
KNN k=1 | 1,037 | $ 11,306.00 |
|
88.15659%
|
|||||||||
KNN k=2 | 1,114 | 11,155.50 |
|
86.76908%
|
|||||||||
KNN k=3 | 1,101 | 11,573.00 |
|
90.0892%
|
|||||||||
KNN k=4 | 1,101 | 11,573.50 |
|
88.80079%
|
|||||||||
KNN k=5 | 1,130 | 11,718.00 |
|
90.03964%
|
|||||||||
KNN k=6 | 1,142 | 11,737.50 |
|
89.74232%
|
|||||||||
KNN k=7 | 1,142 | 11,813.50 |
|
90.58474%
|
|||||||||
KNN k=9 | 1,150 | 11,852.00 |
|
84.8696%
|
|||||||||
KNN k=10 | 1,157 | 11,765.50 |
|
89.39544%
|
Generalized Additive Models (GAM)
GAM models go beyond linear models by allowing non-linear functions of each variable while being additive, meaning that the effect of changes in a predictor on the response is independent of the values of the other predictors. Thus, one of the assumptions is that interaction amongst various marketing activities are not creating some type of synergy whereby we need to consider an interaction term for example, when there is a mailing campaign combined with radio, television or other marketing activities. GAM uses smoothing splines via backfitting which allows multiple predictors by repeatedly updating the fit for each predictor.
Model: GAM | Number mailed | Profit | Confusion Matrix | Accuracy | |||||||||
1 | 1,036 | $10,818.50 |
|
87.56194252%
|
Variables: reg1, reg2, home, chld, I(hinc^2), wrat, npro, tdon , tlag , tdon
Decision Trees
Decision trees can be applied to both regression and classification problems and are commonly used for their ease of interpretability. The top of the tree is split into branches based of the best split at that particular junction without looking forward to determine if a more optimized split should be completed which is considered to be a ‘greedy’ approach. At each split the predictor space is split based on the greatest possible reduction in the RSS. As the training set is likely to create a too complex tree and overfit the data, an alternative is to build the tree as long as the decrease in the RSS exceeds a threshold.
Various decision trees were also modeled including 16, 9 and 5 terminal nodes. The results are as follows:
Model | Number mailed | Profit | Confusion Matrix | Accuracy | |||||||||
Decision tree – 15 | 1,168 | $ 11,149.00 |
|
84.83647%
|
|||||||||
Decision tree – 9 | 962 | 10,038.50 |
|
84.5887%
|
|||||||||
Decision tree – 5 | 1,078 | 10,212.50 |
|
81.61546%
|
Bagging
Bagging focuses on reducing the variance of a statistical learning method such as decision trees. Bagging can reduce the high variance and low bias of decision trees by combining hundreds or thousands of trees into one procedure by taking repeated samples from the training set and averaging all of the predictions.
Model: Bagging | Number mailed | Profit | Confusion Matrix | Accuracy | |||||||||
Mtry=20 | 1,037 | $ 11,063.00 |
|
88.8999% | |||||||||
Mtry=10 | 1,050 | 11,167.50 |
|
89.14767% | |||||||||
Mtry=6 | 1,063 | 11,301.00 |
|
89.5441% | |||||||||
Mtry=5 | 1,055 | 11,143.00 |
|
88.80079% |
Random Forest
Improving bagging by tweaking the splits in the decision tree is the basis for Random Forests. Based on a random sample of predictors is chosen from the full set of predictors with a fresh sample of predictors being taken at each split and thus, makes the average of the resulting trees less variable and more reliable.
Additionally, the Importance function in R, provides a list of the most important variables. As shown below, this list was used in the choice of variables in each of the Random Forest Models as reflected with the ‘mtry’ selection. For instance, mtry=2 uses the two best variables, chld and home.
Model: Random Forest | Number mailed | Profit | Confusion Matrix | Accuracy | |||||||||
Mtry=5 | 1,059 | $ 11,149.50 |
|
87.0998%
|
|||||||||
Mtry=3 | 1,065 | 11,152.00 |
|
88.60258%
|
|||||||||
Mtry=2 | 1,101 | 11,341.00 |
|
88.50347%
|
Neural Network
A neural network mimics the learning pattern of natural biological neural networks. They begin with a single perceptron that receives inputs, applies a weighting then passes into a activation function to produce an output which is then layered to create a network. Hidden layers are the layers between the input and output layers where you cannot see the inputs or outputs (Portilla, Jose. 2016. A Beginner’s Guide to Neural Networks with R! Retrieved from kdnuggets.com).
Model | Number mailed | Profit | Confusion Matrix | Accuracy | |||||||||
Neural Network – 10 10 50 | 1,047 | $ 10,767.50 |
|
86.57086224%
|
|||||||||
Neural Network – 10 10 | 1,054 | 10,855.00 |
|
86.86818632%
|
Support Vector Machine
Support vector machines are intended for binary classifications such as donr and when there are non-linear class boundaries which is addresses by enlarging the feature space using polynomial functions as predictors using kernels. A kernel is a function that quantifies the similarity of two observations. A radial kernel is effected by nearby training observations but there are also other types including linear, polynomial and sigmoid.
Model: Support Vector Machine | Number mailed | Profit | Confusion Matrix | Accuracy | |||||||||
Tune.out Kernel=Linear | 816 | $ 8,750.00 |
|
81.0208127%
|
|||||||||
Tune.out2 Kernel=Linear | 1,060 | 10,422.50 |
|
83.6967294%
|
|||||||||
Tune.out3 Kernel=Linear | 1,056 | 10,401.50 |
|
83.6967294%
|
|||||||||
Tune.out4
Kernel=polynomial |
1,068 | 9,957.00
|
|
80.227948%
|
Prediction Modeling
Goal: Minimize the prediction error based on the donation amount
Quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation. This is measured using the mean squared error (MSE). The MSE will be small if the predicted responses are close to the true responses.
Least Squares Regression
Least squares regression measures closeness by minimizing the sum of squared errors by minimizing the residuals
Three models using PLS regression with LS1 using all of the variables, LS2 using all except genf and wrat but added the transformed hinc2 and LS3 used the most significant factors as determined by the Random Forest model.
Model | Mean Prediction Error | Standard Error |
LS1 | 1.867523003 | 0.1696615221 |
LS2 | 1.973121015 | 0.1720099842 |
LS3 | 1.857983465 | 0.1699141406 |
Variables: LS3 chld, hinc, reg2, home, wrat, tdon, tgif, incm, npro, tlag, avhv, inca, plow, agif, lgif, rgif, reg1, reg4, reg3).
Principal Component Regression
The principal Component Regression reflects the trade between the number of components that explain most of the variability in the data that is fit using least squares.
Training Variance Explained by the number of components is shown below which reflects the percentage of variance explained by the number of different components i.e. using half of the components explains 77.67% of all variance. The plot also reflects ‘elbows’ where the variance is:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
X | 16.08851 | 27.91 | 36.73 | 45.01 | 51.11 | 56.8 | 62.35 | 67.59 | 72.71 | 77.67 |
damt | 0.02819 | 28.46 | 28.54 | 36.52 | 46.58 | 47.08 | 48.92 | 49.23 | 49.36 | 49.57 |
11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
X | 82.46 | 87.1 | 90.46 | 92.7 | 94.8 | 96.27 | 97.61 | 98.64 | 99.57 | 100 |
damt | 49.77 | 51.17 | 51.98 | 51.98 | 56.49 | 56.5 | 56.58 | 56.58 | 57.18 | 57.22 |
Model – # Components | Mean Prediction Error | Standard Error |
PCR – 2 | 2.953948616 | 0.2225891948 |
PCR – 3 | 2.946779495 | 0.2230407325 |
PCR – 5 | 2.15543888 | 0.1864263297 |
PCR – 10 | 2.079405939 | 0.1839018125 |
PCR – 15 | 1.865496754 | 0.1698902112 |
PCR – 20 | 1.867523003 | 0.1696615221 |
The table above reflects standardized predictors with a ten-fold cross-validation error for each possible value.
Based on the MPE, a PCR with 15 components results in the lowest MPE.
Partial Least Squares (PLS) Regression
Partial Lease Squares is a dimension reduction method that first identifies a new set of features that are linear combination of the original features and then using least squares fits a linear model. It considers the predictor variables that are most strongly related to the response variable.
PLS – # of Components | Mean Prediction Error | Standard Error |
6 | 1.868145 | 0.1700303 |
3 | 1.815055 | 0.1701451 |
2 | 1.852769 | 0.1712954 |
Best Subset Selection with K-fold Cross-Validation
Best subset selection fits a separate least squares regression to each possible combination of the predictors and chooses the best model based on the smallest RSS.
Results in a ten-variable model consisting of: reg3, reg4, home, chld, hinc, incm, plow, npro, rgif, agif.
Model – Best Subset w/ k-Fold Cross Validation | Mean Prediction Error | Standard Error |
10 Variables | 1.812687159 | 0.1686750592 |
The best subset selection plot, as shown below, reflects the top row with all of the variables according to the optimal model. The optimal models are based on four different statistical methods, R-squared, Adjusted r-squared, Mallow’s Cp and Bayesian Information Criterion (BIC).
Ridge Regression
Expanded the grid from to cover a whole range of scenarios. The variables are standardized by default. Using the best function, the best value of lambda is applied using cross-validation.
Ridge Regression Model | Mean Prediction Error | Standard Error |
Best = 0.1107589 | 1.873351 | 0.1711236 |
The Lasso
As ridge regression will always build a model with all of the predictors, Lasso overcomes this by performing variable selection that lead to the smallest RSS subject to a set constraint. Thus, for every value of there is a lasso coefficient estimate. It is closely related to the best subset selection.
Lasso Model | Mean Prediction Error | Standard Error |
Best = 0.00877745 | 1.86133 | 0.1694185 |
Fitting a lasso model, we can see that depending on the choice of the tuning parameter, some of the coefficients will be exactly zero.
Results
The best models for classification and prediction based are addressed in two separate parts.
Classification
As reflected above in the ten classification models (Logistic Regression, GAM, LDA, Log, Decision Trees, QDA, Bagging, Random Forest, KNN, Neural Network and Support Vector Machine), the results on the training and test / validation data provided different results in determining how to maximize the gross margin on the marketing campaign.
Based on the accuracy on determine which donors are most likely to donate, the top three models are:
- K-Nearest Neighbors with k=7 resulting in 90.58% accuracy on the test data
- Bagging using 6 variables at each split resulting in 89.54% accuracy on the test data
- Random Forest with 5 variables at each split resulting in 88.80% accuracy on the test data
The best classification models based on Gross Margin are as follows:
- K-Nearest Neighbors with k= 9 resulting in $11,852.00
- Logarithmic resulting in $11,642.50
- Linear Discriminant Analysis resulting in $11,624.50
For future data collection, some variables have proven to be more important than other variables which may assist in reducing costs and /or replacing with other variables that may be more beneficial to the marketing efforts. For example, Random Forest determined that the number of children a donor has and whether they are a home owner were important in determining if they would donate. Additionally, while most of the regions did not play a huge role in determining whether a donor would donate, region 2 was more important than the other three regional variables included in the data set.
Prediction
As reflected above in the six prediction models (Partial Least Squares, Principal Component Regression, Best Subset, Ridge and Lasso), the Mean Prediction Error (MPE) and Standard Error (SE) on the different data sets provided different results in determining the amount of the donation. The top three prediction models were:
- Best Subset with a MPE of 1.8126 and a SE of .168675
- Partial Least Squares with 3 with a MPE of 1.8150 and a SE of .1701451
- Lasso with a MPE of 1.86133 and a SE of .1694185
In order to maximize the value of the list of people who have donated to the charity we need to determine the two parts of the equation a) who will donate and b) how much. This is accomplished by applying various models to determine which model predicts who will donate which is reflected in the gross margin value of the model and the amount of the donation using the mean prediction error. Based on the above, the best models to use are a) K-Nearest Neighbors with k=7 to determine who will donate and b) the best model for the donation amount which is the Best Subset Selection based on ten variables and also has the lowest standard error which tells us the average amount of the differences from the actual values.
Machine Learning: Charity Donor Analysis Conclusions
Utilizing the various training, validation and test datasets has provided insight into the type of data that is collected by Charity Inc. and how it can be utilized to benefit the company’s marketing campaign. The exploratory data analysis reflected the different types of data that were contained within the datasets along with their statistical information and plots for ease of reference.
Next, a variety of machine learning and regular models were produced for both classification and prediction modeling. The majority of the models detailed various iterations along with the number of mailouts, profit, confusion matrix and accuracy ratio. While there are numerous other models that could also be included along with additional iterations and transformations of variables this version has tried to remain easily interpreted.
Based on the analysis, the K-Nearest Neighbor is the best model for prediction error while the Best Subset Selection model is best for determining which people on the mailing list are most likely to donate. Based on the test data, the marketing campaign should be focused on 354 donors who will on average donate $14.50 for a total predicted donation amount of $5,132. Additionally, through the modelling efforts contained within, Charity Inc. can be more confident in determining who should be targeted to maximize revenues while minimizes costs.
In future iterations, segregating donors based on their projected contributions could be helpful in customizing mailout and mailout materials (and related costs) to future maximize the charities marketing efforts and resulting donations.