# Machine Learning:  Charity Donor Analysis

## Introduction

A charitable organization wishes to develop a machine learning model to improve the cost effectiveness of their direct marketing campaigns to previous donors. The recent mailing records reflect an overall 10% response rate with an average donation of \$14.50. The cost to produce and send each mail is \$2. The data consists of 3,984 training observations, 2,018 validation observations and 2,007 test observations. Weighted sampling has been used which over-represents the responders so that the training and validation samples have approximately equal numbers of donors and non-donors. Additionally, a prediction model to predict the expected gift amounts from donors should also be built based on the records of donors only.

## Analysis

### Exploratory Data Analysis

Based on R’s describe function, there are 22 variables and 8,009 observations in the initial dataset with no missing data. The observations reflect prior mailing data including the whether the mailing was successful and a donation was received, the amount of the donation along with numerous other aspects that may or may not be helpful in determining who to send future donation requests to in order to maximize the amount of donations. Categorical and numerical variables along with a brief description and statistical information are shown below.

Categorical Variables:

 Var Description Mean SD Median Min Max skew Estimate Std. Error t value Pr(>|t|) reg1 Geographic region 0.2 0.4 0 0 1 1.5 1.44E-01 1.38E-02 10.462 <2.00E-16 *** reg2 Geographic region 0.32 0.47 0 0 1 0.78 2.92E-01 1.23E-02 23.793 <2.00E-16 *** reg3 Geographic region 0.13 0.34 0 0 1 2.15 4.74E-03 1.59E-02 0.299 0.76505 reg4 Geographic region 0.14 0.35 0 0 1 2.08 2.36E-02 1.54E-02 1.53 0.12609 home Homeowner; 1=homeowner 0= Not 0.87 0.34 1 0 1 -2.16 3.23E-01 1.37E-02 23.583 <2.00E-16 *** wrat Wealth Rating 0-9; 9 highest 6.91 2.43 8 0 9 -1.35 3.50E-02 1.87E-03 18.651 <2.00E-16 *** genf Gender; 1=M; 1=F 0.61 0.49 1 0 1 -0.43 -1.54E-03 8.90E-03 -0.173 0.86246 donr Donar; 1-Donor; 0=Non-donor 0.5 0.5 0 0 1 0

Numerical Variables:

 Var Description Mean SD Median Min Max skew Estimate Std. Error t value Pr(>|t|) chld Number of children 1.72 1.4 2 0 5 0.27 -1.65E-01 3.11E-03 -52.918 <2.00E-16 *** hinc Household Income 7 categories 3.91 1.47 4 1 7 0.01 3.42E-01 1.24E-02 27.476 <2.00E-16 *** I(hinc^2) -4.24E-02 1.53E-03 -27.643 <2.00E-16 *** avhv Average Home Value in 000’s 5.14 0.37 5.13 3.87 6.57 0.14 1.35E-02 2.23E-02 0.604 0.5459 incm Median Family Income in 000’s 43.47 24.71 38 3 287 2.05 1.59E-03 3.74E-04 4.262 2.06E-05 *** inca Average Family Income in 000’s 56.43 24.82 51 12 305 1.94 -3.27E-05 4.32E-04 -0.076 0.93969 plow % categorized as “low income” 14.23 13.41 10 0 87 1.36 -1.40E-03 5.08E-04 -2.76 0.00581 ** npro Lifetime # of promotions rec’d to date 60.03 30.35 58 2 164 0.31 1.46E-03 2.06E-04 7.103 1.36E-12 *** tgif \$ value of lifetime gifts to date 113.07 85.48 89 23 2057 6.55 1.31E-04 7.31E-05 1.786 0.07422 . lgif \$ value of largest gift to date 22.94 29.95 16 3 681 7.81 -5.88E-05 2.22E-04 -0.265 0.79094 rgif \$ value of most recent gift 15.66 12.43 12 1 173 2.63 -3.46E-04 5.65E-04 -0.612 0.54062 tdon # months since last donation 18.86 5.78 18 5 40 1.1 -5.27E-03 7.77E-04 -6.78 1.32E-11 *** tlag # of months between 1st and 2nd gift 6.36 3.7 5 1 34 2.42 -1.23E-02 1.20E-03 -10.29 <2.00E-16 *** agif Average \$ value of gifts to date 11.68 6.57 10.23 1.29 72.27 1.78 1.10E-03 9.78E-04 1.12 0.26285 damt Donation amount in \$’s 7.21 7.36 0 0 27 0.12

Machine Learning: Number of Children

Machine Learning: Donation Times

Machine Learning: Wealth Rating Bar Chart

Machine Learning: Gender Bar Chart

Machine Learning: 5 variables

Machine Learning: Kernel Density of AVHV

## Correlations

Reviewing the correlation between variables and donation amount (damt) some of the larger correlations are as follows:

 ID reg1 reg2 reg3 reg4 home chld hinc genf wrat tdon 0.0693 -0.0180 -0.0168 -0.0157 0.0222 0.0162 0.0602 0.0147 -0.0033 -0.0215 donr 0.0308 0.0565 0.2471 -0.1043 -0.1263 0.2890 -0.5308 0.0277 -0.0173 0.2493 damt 0.0277 0.0469 0.2115 -0.0837 -0.0887 0.2877 -0.5513 0.0453 -0.0190 0.2429 avhv incm inca plow npro tgif lgif rgif tdon tlag avhv 1.0000 0.6939 0.8072 -0.7118 0.0030 0.0178 -0.0167 -0.0052 -0.0064 -0.0061 incm 0.6939 1.0000 0.8729 -0.6555 0.0171 0.0417 0.0067 0.0027 -0.0139 -0.0213 inca 0.8072 0.8729 1.0000 -0.6346 0.0183 0.0354 -0.0075 -0.0054 -0.0163 -0.0169 tgif 0.0178 0.0417 0.0354 -0.0176 0.7266 1.0000 0.1734 0.0736 -0.0113 -0.0117 lgif -0.0167 0.0067 -0.0075 0.0059 -0.0013 0.1734 1.0000 0.6961 0.0036 0.0168 rgif -0.0052 0.0027 -0.0054 -0.0136 -0.0125 0.0736 0.6961 1.0000 0.0063 0.0122 agif 0.0000 0.0103 -0.0002 -0.0137 0.0022 0.0558 0.6096 0.7053 0.0080 0.0299 agif donr damt chld 0.0149 -0.5308 -0.5513 rgif 0.7053 0.0149 0.0851 donr 0.0095 1.0000 0.9826

## Logistic Regression

Some of the significant variables are reflected below but additional insight into which variables are more significant based on the model will be provided in the discussion of the related model.

 Estimate Std. Error z value Pr(>|z|) (Intercept) 0.5184 0.0661 7.845 4.34E-15 *** reg1 0.6449 0.0716 9.012 < 2.00E-16 *** reg2 1.4807 0.0842 17.577 < 2.00E-16 *** home 1.3657 0.0836 16.342 < 2.00E-16 *** chld -2.3711 0.0854 -27.779 < 2.00E-16 *** I(hinc^2) -1.0750 0.0530 -20.283 < 2.00E-16 *** wrat 0.9726 0.0667 14.589 < 2.00E-16 *** npro 0.4552 0.0786 5.790 7.05E-09 *** tdon -0.2945 0.0603 -4.883 1.04E-06 *** tlag -0.5543 0.0602 -9.207 < 2.00E-16 ***

# Classification Modeling

Goal:  Maximize the gross margin (expected revenue less the costs) on the marketing campaign

 0 ≠ donor 1 = donor Total 0 526 23 549 1 = mail 493 976 1,469

Utilizing the data, various machine learning modelling techniques were applied to determine the number of households to receive the marketing materials that are most likely to be a donor.  The outcome of each technique results in a confusion matrix that is used to determine the percentage of mailers that are successful versus those that do not ultimately result in a donation.  For example, in Quadratic Discriminant Analysis (QDA) the resulting confusion matrix is:

 0 ≠ donor 1 = donor Total 0 526 23 549 1 = mail 493 976 1,469

The matrix reflects those campaign mailing targets that are likely to donate.  Unfortunately, we won’t always be correct and thus, will expend funds without a return.  Targeting 1,469 households that are thought to be donors will only result with 976 donations. On the flipside, a cost savings of 549 mailouts will be achieved but not sending the campaign to targets who will not provide a donation.  Thus, accuracy is important by correctly knowing who is or who is not a donor and is based on correctly determining those are not a donor and do not donate and those who are a donor and do donate is valuable and is calculated in as (526+976)/(549+1469).  Thus, our accuracy is 74.43%.

The second part will deal with the size of the donation but for now, we are using past experience which reflects an average donation of \$14.50 and a cost of \$2.00 each.  Thus, our margin is (\$14.50 * 976) – (\$2.00 * 1,469) = \$11,214.  It is easy to understand why it is important to mail to those donors who are most likely to provide a donation.

A summary by each machine learning method is as follows:

## Logistic Regression

Logistic regression models the probability of a binary response based on one or more predictor variables.  For instance, whether someone will donate to the charity or will not donate.

Log models were created based on earlier models.  The first model reflecting 21 variables and the second reflecting 9 variables that were selected based on their significance.  The plot reflects the log function based on the profit.

Model: Log Number mailed Profit Confusion Matrix Accuracy
Log1 1,291 \$ 11,642.50
 Pred: 0 Pred: 1 Actual: 0 709 18 Actual 1: 310 981
83.74628%

Log2 1,389   11,533.50
 Pred: 0 Pred: 1 Actual: 0 617 12 Actual 1: 402 987
79.48464%

Machine Learning: Logistic Regression

Variables Log2:  chld + home + reg2 + I(hinc^2) + wrat + tdon + incm + tlag + tgif

## Linear Discriminant Analysis (LDA)

LDA uses as Bayes classifier and a threshold of the posterior probability to assign observations to a default class by plugging in estimates for several parameters including the weighted average of the sample variances for each class and the number of observations.  The first LDA model uses 21 variables including a transformation on hinc.  LDA2 uses a subset of the most significant variables.

Model: LDA Number mailed Profit Confusion Matrix Accuracy
LDA1 1,329 \$ 11,624.50
 Pred: 0 Pred: 1 Actual: 0 675 14 Actual 1: 344 985
82.25966%
LDA2 1,348 11,557.50
 Pred: 0 Pred: 1 Actual: 0 654 16 Actual 1: 365 983
81.11992%

Variables in LDA2:  reg1, reg2, reg3, chld, home, I(hinc^2), wrat, npro, tdon, tlag

Quadratic Discriminant Analysis is similar to LDA as it assumes that the observations from each class are drawn from a Gaussian distribution creates a prediction based on plugging in parameter estimates into Bayes’ theorem.  However, QDA uses a quadratic function and creates its own covariance matrix unlike LDA making QDA more flexible with a lower variance.

In each of the QDA models, a different set of variables were used with QDA1 using nine variables, QDA using all 21 variables and QDA3 using seven variables.

Model:  QDA Number mailed Profit Confusion Matrix Accuracy
QDA1 1,469 \$ 11,214.00
 Pred: 0 Pred: 1 Actual: 0 761 80 Actual 1: 258 919
83.25074%
QDA2 1,372 11,219.50
 Pred: 0 Pred: 1 Actual: 0 795 103 Actual 1: 224 896
77.94846%
QDA3 1,402 11,232.00
 Pred: 0 Pred: 1 Actual: 0 610 36 Actual 1: 434 968
76.95738%

Variables:  QDA1:  reg1, reg2, home, chld, I(hinc^2), wrat, npro, tdon, tlag;     QDA3:  chld, home, wrat, hinc, reg2, tdon, incm

## K=Nearest Neighbors (KNN)

KNN assigns a weighting so the nearer ‘neighbors’ contribute more to the average than observations further away.  A small K value will provide the most flexible fit and have a low bias but high variance due to the prediction relying on one observation and vice versa, a larger K value will provide a smoother and less variable fit but may cause bias by masking some of the structure.

Model: KNN Number mailed Profit Confusion Matrix Accuracy
KNN k=1 1,037 \$ 11,306.00
 Pred: 0 Pred: 1 Actual: 0 847 67 Actual 1: 172 932
88.15659%

KNN k=2 1,114    11,155.50
 Pred: 0 Pred: 1 Actual: 0 828 76 Actual 1: 191 923
86.76908%

KNN k=3 1,101    11,573.00
 Pred: 0 Pred: 1 Actual: 0 868 49 Actual 1: 151 950
90.0892%

KNN k=4 1,101    11,573.50
 Pred: 0 Pred: 1 Actual: 0 837 44 Actual 1: 182 955
88.80079%

KNN k=5 1,130    11,718.00
 Pred: 0 Pred: 1 Actual: 0 853 35 Actual 1: 166 964
90.03964%

KNN k=6 1,142    11,737.50
 Pred: 0 Pred: 1 Actual: 0 844 32 Actual 1: 175 967
89.74232%

KNN k=7 1,142    11,813.50
 Pred: 0 Pred: 1 Actual: 0 857 28 Actual 1: 162 971
90.58474%

KNN k=9 1,150    11,852.00
 Pred: 0 Pred: 1 Actual: 0 845 23 Actual 1: 174 976
84.8696%

KNN k=10 1,157    11,765.50
 Pred: 0 Pred: 1 Actual: 0 833 28 Actual 1: 186 971
89.39544%

GAM models go beyond linear models by allowing non-linear functions of each variable while being additive, meaning that the effect of changes in a predictor on the response is independent of the values of the other predictors.  Thus, one of the assumptions is that interaction amongst various marketing activities are not creating some type of synergy whereby we need to consider an interaction term for example, when there is a mailing campaign combined with radio, television or other marketing activities.  GAM uses smoothing splines via backfitting which allows multiple predictors by repeatedly updating the fit for each predictor.

Model: GAM Number mailed Profit Confusion Matrix Accuracy
1 1,036 \$10,818.50
 Pred: 0 Pred: 1 Actual: 0 880 112 Actual 1: 139 887
87.56194252%

Machine Learning: Decision Trees Nodes

Variables: reg1, reg2, home, chld, I(hinc^2), wrat, npro, tdon , tlag , tdon

## Decision Trees

Decision trees can be applied to both regression and classification problems and are commonly used for their ease of interpretability.  The top of the tree is split into branches based of the best split at that particular junction without looking forward to determine if a more optimized split should be completed which is considered to be a ‘greedy’ approach.  At each split the predictor space is split based on the greatest possible reduction in the RSS.  As the training set is likely to create a too complex tree and overfit the data, an alternative is to build the tree as long as the decrease in the RSS exceeds a threshold.

Various decision trees were also modeled including 16, 9 and 5 terminal nodes.  The results are as follows:

Model Number mailed Profit Confusion Matrix Accuracy
Decision tree – 15 1,168 \$ 11,149.00
 Pred: 0 Pred: 1 Actual: 0 783 70 Actual 1: 236 929
84.83647%

Decision tree – 9 962   10,038.50
 Pred: 0 Pred: 1 Actual: 0 882 174 Actual 1: 137 825
84.5887%

Decision tree – 5 1,078   10,212.50
 Pred: 0 Pred: 1 Actual: 0 794 146 Actual 1: 225 853
81.61546%

Machine Learning: Decision Trees

## Bagging

Bagging focuses on reducing the variance of a statistical learning method such as decision trees.  Bagging can reduce the high variance and low bias of decision trees by combining hundreds or thousands of trees into one procedure by taking repeated samples from the training set and averaging all of the predictions.

Model:  Bagging Number mailed Profit Confusion Matrix Accuracy
Mtry=20 1,037 \$ 11,063.00
 Pred: 0 Pred: 1 Actual: 0 888 93 Actual 1: 131 906
88.8999%
Mtry=10 1,050   11,167.50
 Pred: 0 Pred: 1 Actual: 0 884 84 Actual 1: 135 915
89.14767%
Mtry=6 1,063 11,301.00
 Pred: 0 Pred: 1 Actual: 0 882 74 Actual 1: 137 925
89.5441%
Mtry=5 1,055  11,143.00
 Pred: 0 Pred: 1 Actual: 0 879 86 Actual 1: 140 913
88.80079%

## Random Forest

Improving bagging by tweaking the splits in the decision tree is the basis for Random Forests.  Based on a random sample of predictors is chosen from the full set of predictors with a fresh sample of predictors being taken at each split and thus, makes the average of the resulting trees less variable and more reliable.

Additionally, the Importance function in R, provides a list of the most important variables.  As shown below, this list was used in the choice of variables in each of the Random Forest Models as reflected with the ‘mtry’ selection.  For instance, mtry=2 uses the two best variables, chld and home.

Machine Learning: Random Forest

Model:  Random Forest Number mailed Profit Confusion Matrix Accuracy
Mtry=5   1,059 \$ 11,149.50
 Pred: 0 Pred: 1 Actual: 0 878 85 Actual 1: 141 914
87.0998%

Mtry=3   1,065   11,152.00
 Pred: 0 Pred: 1 Actual: 0 873 84 Actual 1: 146 915
88.60258%

Mtry=2   1,101 11,341.00
 Pred: 0 Pred: 1 Actual: 0 852 65 Actual 1: 167 934
88.50347%

## Neural Network

A neural network mimics the learning pattern of natural biological neural networks.  They begin with a single perceptron that receives inputs, applies a weighting then passes into a activation function to produce an output which is then layered to create a network.  Hidden layers are the layers between the input and output layers where you cannot see the inputs or outputs (Portilla, Jose. 2016.   A Beginner’s Guide to Neural Networks with R!  Retrieved from kdnuggets.com).

Model Number mailed Profit Confusion Matrix Accuracy
Neural Network –  10 10 50   1,047 \$ 10,767.50
 Pred: 0 Pred: 1 Actual: 0 862 157 Actual 1: 114 885
86.57086224%

Neural Network –  10 10   1,054    10,855.00
 Pred: 0 Pred: 1 Actual: 0 859 160 Actual 1: 105 894
86.86818632%

Machine Learning: Neural Network 10 10

Machine Learning: Neural Network 10 10 50

## Support Vector Machine

Support vector machines are intended for binary classifications such as donr and when there are non-linear class boundaries which is addresses by enlarging the feature space using polynomial functions as predictors using kernels.  A kernel is a function that quantifies the similarity of two observations.  A radial kernel is effected by nearby training observations but there are also other types including linear, polynomial and sigmoid.

Model:  Support Vector Machine Number mailed Profit Confusion Matrix Accuracy
Tune.out Kernel=Linear 816 \$  8,750.00
 Pred: 0 Pred: 1 Actual: 0 919 283 Actual 1: 100 716
81.0208127%

Tune.out2 Kernel=Linear 1,060    10,422.50
 Pred: 0 Pred: 1 Actual: 0 824 134 Actual 1: 195 865
83.6967294%

Tune.out3 Kernel=Linear 1,056    10,401.50
 Pred: 0 Pred: 1 Actual: 0 826 136 Actual 1: 193 863
83.6967294%

Tune.out4

Kernel=polynomial

1,068    9,957.00

 Pred: 0 Pred: 1 Actual: 0 785 165 Actual 1: 234 834
80.227948%

# Prediction Modeling

Goal:  Minimize the prediction error based on the donation amount

Quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation.  This is measured using the mean squared error (MSE).  The MSE will be small if the predicted responses are close to the true responses.

## Least Squares Regression

Least squares regression measures closeness by minimizing the sum of squared errors by minimizing the residuals

Three models using PLS regression with LS1 using all of the variables, LS2 using all except genf and wrat but added the transformed hinc2 and LS3 used the most significant factors as determined by the Random Forest model.

 Model Mean Prediction Error Standard Error LS1 1.867523003 0.1696615221 LS2 1.973121015 0.1720099842 LS3 1.857983465 0.1699141406

Variables: LS3 chld, hinc, reg2, home, wrat, tdon, tgif, incm, npro, tlag, avhv, inca, plow, agif, lgif, rgif, reg1, reg4, reg3).

## Principal Component Regression

The principal Component Regression reflects the trade between the number of components that explain most of the variability in the data that is fit using least squares.

Training Variance Explained by the number of components is shown below which reflects the percentage of variance explained by the number of different components i.e. using half of the components explains 77.67% of all variance.  The plot also reflects ‘elbows’ where the variance is:

 1 2 3 4 5 6 7 8 9 10 X 16.08851 27.91 36.73 45.01 51.11 56.8 62.35 67.59 72.71 77.67 damt 0.02819 28.46 28.54 36.52 46.58 47.08 48.92 49.23 49.36 49.57 11 12 13 14 15 16 17 18 19 20 X 82.46 87.1 90.46 92.7 94.8 96.27 97.61 98.64 99.57 100 damt 49.77 51.17 51.98 51.98 56.49 56.5 56.58 56.58 57.18 57.22

 Model – # Components Mean Prediction Error Standard Error PCR – 2 2.953948616 0.2225891948 PCR – 3 2.946779495 0.2230407325 PCR – 5 2.15543888 0.1864263297 PCR – 10 2.079405939 0.1839018125 PCR – 15 1.865496754 0.1698902112 PCR – 20 1.867523003 0.1696615221

Machine Learning: Principal Components Regression

The table above reflects standardized predictors with a ten-fold cross-validation error for each possible value.

Based on the MPE, a PCR with 15 components results in the lowest MPE.

## Partial Least Squares (PLS) Regression

Partial Lease Squares is a dimension reduction method that first identifies a new set of features that are linear combination of the original features and then using least squares fits a linear model.  It considers the predictor variables that are most strongly related to the response variable.

 PLS – # of Components Mean Prediction Error Standard Error 6 1.868145 0.1700303 3 1.815055 0.1701451 2 1.852769 0.1712954

## Best Subset Selection with K-fold Cross-Validation

Best subset selection fits a separate least squares regression to each possible combination of the predictors and chooses the best model based on the smallest RSS.

Results in a ten-variable model consisting of: reg3, reg4, home, chld, hinc, incm, plow, npro, rgif, agif.

 Model – Best Subset w/ k-Fold Cross Validation Mean Prediction Error Standard Error 10 Variables 1.812687159 0.1686750592

The best subset selection plot, as shown below, reflects the top row with all of the variables according to the optimal model.  The optimal models are based on four different statistical methods, R-squared, Adjusted r-squared, Mallow’s Cp and Bayesian Information Criterion (BIC).

Machine Learning: Best Subset Selection

## Ridge Regression

Expanded the grid from  to cover a whole range of scenarios. The variables are standardized by default.  Using the best function, the best value of lambda is applied using cross-validation.

 Ridge Regression Model Mean Prediction Error Standard Error Best  = 0.1107589 1.873351 0.1711236

## The Lasso

As ridge regression will always build a model with all of the predictors, Lasso overcomes this by performing variable selection that lead to the smallest RSS subject to a set constraint.  Thus, for every value of   there is a lasso coefficient estimate. It is closely related to the best subset selection.

 Lasso Model Mean Prediction Error Standard Error Best  = 0.00877745 1.86133 0.1694185

Fitting a lasso model, we can see that depending on the choice of the tuning parameter, some of the coefficients will be exactly zero.

Lasso

# Results

The best models for classification and prediction based are addressed in two separate parts.

### Classification

As reflected above in the ten classification models (Logistic Regression, GAM, LDA, Log, Decision Trees, QDA, Bagging, Random Forest, KNN, Neural Network and Support Vector Machine), the results on the training and test / validation data provided different results in determining how to maximize the gross margin on the marketing campaign.

Based on the accuracy on determine which donors are most likely to donate, the top three models are:

• K-Nearest Neighbors with k=7 resulting in 90.58% accuracy on the test data
• Bagging using 6 variables at each split resulting in 89.54% accuracy on the test data
• Random Forest with 5 variables at each split resulting in 88.80% accuracy on the test data

The best classification models based on Gross Margin are as follows:

• K-Nearest Neighbors with k= 9 resulting in \$11,852.00
• Logarithmic resulting in \$11,642.50
• Linear Discriminant Analysis resulting in \$11,624.50

For future data collection, some variables have proven to be more important than other variables which may assist in reducing costs and /or replacing with other variables that may be more beneficial to the marketing efforts.  For example, Random Forest determined that the number of children a donor has and whether they are a home owner were important in determining if they would donate.  Additionally, while most of the regions did not play a huge role in determining whether a donor would donate, region 2 was more important than the other three regional variables included in the data set.

### Prediction

As reflected above in the six prediction models (Partial Least Squares, Principal Component Regression, Best Subset, Ridge and Lasso), the Mean Prediction Error (MPE) and Standard Error (SE) on the different data sets provided different results in determining the amount of the donation.  The top three prediction models were:

• Best Subset with a MPE of 1.8126 and a SE of .168675
• Partial Least Squares with 3 with a MPE of 1.8150 and a SE of .1701451
• Lasso with a MPE of 1.86133 and a SE of .1694185

In order to maximize the value of the list of people who have donated to the charity we need to determine the two parts of the equation a) who will donate and b) how much.  This is accomplished by applying various models to determine which model predicts who will donate which is reflected in the gross margin value of the model and the amount of the donation using the mean prediction error.  Based on the above, the best models to use are a) K-Nearest Neighbors with k=7 to determine who will donate and b) the best model for the donation amount which is the Best Subset Selection based on ten variables and also has the lowest standard error which tells us the average amount of the differences from the actual values.

# Machine Learning:  Charity Donor Analysis Conclusions

Utilizing the various training, validation and test datasets has provided insight into the type of data that is collected by Charity Inc. and how it can be utilized to benefit the company’s marketing campaign.  The exploratory data analysis reflected the different types of data that were contained within the datasets along with their statistical information and plots for ease of reference.

Next, a variety of machine learning and regular models were produced for both classification and prediction modeling.  The majority of the models detailed various iterations along with the number of mailouts, profit, confusion matrix and accuracy ratio.  While there are numerous other models that could also be included along with additional iterations and transformations of variables this version has tried to remain easily interpreted.

Based on the analysis, the K-Nearest Neighbor is the best model for prediction error while the Best Subset Selection model is best for determining which people on the mailing list are most likely to donate.  Based on the test data, the marketing campaign should be focused on 354 donors who will on average donate \$14.50 for a total predicted donation amount of \$5,132.  Additionally, through the modelling efforts contained within, Charity Inc. can be more confident in determining who should be targeted to maximize revenues while minimizes costs.

In future iterations, segregating donors based on their projected contributions could be helpful in customizing mailout and mailout materials (and related costs) to future maximize the charities marketing efforts and resulting donations.