Machine Learning: Charity Donor Analysis

September 19, 2017 admin

Analysis, Data, R Programming, Statistics

Machine Learning: Charity Donor Analysis

Introduction

A charitable organization wishes to develop a machine learning model to improve the cost effectiveness of their direct marketing campaigns to previous donors. The recent mailing records reflect an overall 10% response rate with an average donation of $14.50. The cost to produce and send each mail is $2. The data consists of 3,984 training observations, 2,018 validation observations and 2,007 test observations. Weighted sampling has been used which over-represents the responders so that the training and validation samples have approximately equal numbers of donors and non-donors. Additionally, a prediction model to predict the expected gift amounts from donors should also be built based on the records of donors only.

Analysis

Exploratory Data Analysis

Based on R’s describe function, there are 22 variables and 8,009 observations in the initial dataset with no missing data. The observations reflect prior mailing data including the whether the mailing was successful and a donation was received, the amount of the donation along with numerous other aspects that may or may not be helpful in determining who to send future donation requests to in order to maximize the amount of donations. Categorical and numerical variables along with a brief description and statistical information are shown below.

Categorical Variables:

Var	Description	Mean	SD	Median	Min	Max	skew	Estimate	Std.	Error	t value	Pr(>\|t\|)
reg1	Geographic region	0.2	0.4	0	0	1	1.5	1.44E-01	1.38E-02	10.462	<2.00E-16	***
reg2	Geographic region	0.32	0.47	0	0	1	0.78	2.92E-01	1.23E-02	23.793	<2.00E-16	***
reg3	Geographic region	0.13	0.34	0	0	1	2.15	4.74E-03	1.59E-02	0.299	0.76505
reg4	Geographic region	0.14	0.35	0	0	1	2.08	2.36E-02	1.54E-02	1.53	0.12609
home	Homeowner; 1=homeowner 0= Not	0.87	0.34	1	0	1	-2.16	3.23E-01	1.37E-02	23.583	<2.00E-16	***
wrat	Wealth Rating 0-9; 9 highest	6.91	2.43	8	0	9	-1.35	3.50E-02	1.87E-03	18.651	<2.00E-16	***
genf	Gender; 1=M; 1=F	0.61	0.49	1	0	1	-0.43	-1.54E-03	8.90E-03	-0.173	0.86246
donr	Donar; 1-Donor; 0=Non-donor	0.5	0.5	0	0	1	0

Numerical Variables:

Var	Description	Mean	SD	Median	Min	Max	skew	Estimate	Std.	Error	t value	Pr(>\|t\|)
chld	Number of children	1.72	1.4	2	0	5	0.27	-1.65E-01	3.11E-03	-52.918	<2.00E-16	***
hinc	Household Income 7 categories	3.91	1.47	4	1	7	0.01	3.42E-01	1.24E-02	27.476	<2.00E-16	***
I(hinc^2)								-4.24E-02	1.53E-03	-27.643	<2.00E-16	***
avhv	Average Home Value in 000’s	5.14	0.37	5.13	3.87	6.57	0.14	1.35E-02	2.23E-02	0.604	0.5459
incm	Median Family Income in 000’s	43.47	24.71	38	3	287	2.05	1.59E-03	3.74E-04	4.262	2.06E-05	***
inca	Average Family Income in 000’s	56.43	24.82	51	12	305	1.94	-3.27E-05	4.32E-04	-0.076	0.93969
plow	% categorized as “low income”	14.23	13.41	10	0	87	1.36	-1.40E-03	5.08E-04	-2.76	0.00581	**
npro	Lifetime # of promotions rec’d to date	60.03	30.35	58	2	164	0.31	1.46E-03	2.06E-04	7.103	1.36E-12	***
tgif	$ value of lifetime gifts to date	113.07	85.48	89	23	2057	6.55	1.31E-04	7.31E-05	1.786	0.07422	.
lgif	$ value of largest gift to date	22.94	29.95	16	3	681	7.81	-5.88E-05	2.22E-04	-0.265	0.79094
rgif	$ value of most recent gift	15.66	12.43	12	1	173	2.63	-3.46E-04	5.65E-04	-0.612	0.54062
tdon	# months since last donation	18.86	5.78	18	5	40	1.1	-5.27E-03	7.77E-04	-6.78	1.32E-11	***
tlag	# of months between 1st and 2nd gift	6.36	3.7	5	1	34	2.42	-1.23E-02	1.20E-03	-10.29	<2.00E-16	***
agif	Average $ value of gifts to date	11.68	6.57	10.23	1.29	72.27	1.78	1.10E-03	9.78E-04	1.12	0.26285
damt	Donation amount in $’s	7.21	7.36	0	0	27	0.12

Machine Learning: Number of Children

Machine Learning: Donation Times

Machine Learning: Wealth Rating Bar Chart

Machine Learning: Gender Bar Chart

Machine Learning: 5 variables

Machine Learning: Kernel Density of AVHV

Correlations

Reviewing the correlation between variables and donation amount (damt) some of the larger correlations are as follows:

	ID	reg1	reg2	reg3	reg4	home	chld	hinc	genf	wrat
tdon	0.0693	-0.0180	-0.0168	-0.0157	0.0222	0.0162	0.0602	0.0147	-0.0033	-0.0215
donr	0.0308	0.0565	0.2471	-0.1043	-0.1263	0.2890	-0.5308	0.0277	-0.0173	0.2493
damt	0.0277	0.0469	0.2115	-0.0837	-0.0887	0.2877	-0.5513	0.0453	-0.0190	0.2429

	avhv	incm	inca	plow	npro	tgif	lgif	rgif	tdon	tlag
avhv	1.0000	0.6939	0.8072	-0.7118	0.0030	0.0178	-0.0167	-0.0052	-0.0064	-0.0061
incm	0.6939	1.0000	0.8729	-0.6555	0.0171	0.0417	0.0067	0.0027	-0.0139	-0.0213
inca	0.8072	0.8729	1.0000	-0.6346	0.0183	0.0354	-0.0075	-0.0054	-0.0163	-0.0169
tgif	0.0178	0.0417	0.0354	-0.0176	0.7266	1.0000	0.1734	0.0736	-0.0113	-0.0117
lgif	-0.0167	0.0067	-0.0075	0.0059	-0.0013	0.1734	1.0000	0.6961	0.0036	0.0168
rgif	-0.0052	0.0027	-0.0054	-0.0136	-0.0125	0.0736	0.6961	1.0000	0.0063	0.0122
agif	0.0000	0.0103	-0.0002	-0.0137	0.0022	0.0558	0.6096	0.7053	0.0080	0.0299

	agif	donr	damt
chld	0.0149	-0.5308	-0.5513
rgif	0.7053	0.0149	0.0851
donr	0.0095	1.0000	0.9826

Logistic Regression

Some of the significant variables are reflected below but additional insight into which variables are more significant based on the model will be provided in the discussion of the related model.

	Estimate	Std.	Error		z value	Pr(>\|z\|)
(Intercept)	0.5184	0.0661	7.845		4.34E-15	***
reg1	0.6449	0.0716	9.012	<	2.00E-16	***
reg2	1.4807	0.0842	17.577	<	2.00E-16	***
home	1.3657	0.0836	16.342	<	2.00E-16	***
chld	-2.3711	0.0854	-27.779	<	2.00E-16	***
I(hinc^2)	-1.0750	0.0530	-20.283	<	2.00E-16	***
wrat	0.9726	0.0667	14.589	<	2.00E-16	***
npro	0.4552	0.0786	5.790		7.05E-09	***
tdon	-0.2945	0.0603	-4.883		1.04E-06	***
tlag	-0.5543	0.0602	-9.207	<	2.00E-16	***

Classification Modeling

Goal: Maximize the gross margin (expected revenue less the costs) on the marketing campaign

	0 ≠ donor	1 = donor	Total
0	526	23	549
1 = mail	493	976	1,469

Utilizing the data, various machine learning modelling techniques were applied to determine the number of households to receive the marketing materials that are most likely to be a donor. The outcome of each technique results in a confusion matrix that is used to determine the percentage of mailers that are successful versus those that do not ultimately result in a donation. For example, in Quadratic Discriminant Analysis (QDA) the resulting confusion matrix is:

	0 ≠ donor	1 = donor	Total
0	526	23	549
1 = mail	493	976	1,469

The matrix reflects those campaign mailing targets that are likely to donate. Unfortunately, we won’t always be correct and thus, will expend funds without a return. Targeting 1,469 households that are thought to be donors will only result with 976 donations. On the flipside, a cost savings of 549 mailouts will be achieved but not sending the campaign to targets who will not provide a donation. Thus, accuracy is important by correctly knowing who is or who is not a donor and is based on correctly determining those are not a donor and do not donate and those who are a donor and do donate is valuable and is calculated in as (526+976)/(549+1469). Thus, our accuracy is 74.43%.

The second part will deal with the size of the donation but for now, we are using past experience which reflects an average donation of $14.50 and a cost of $2.00 each. Thus, our margin is ($14.50 * 976) – ($2.00 * 1,469) = $11,214. It is easy to understand why it is important to mail to those donors who are most likely to provide a donation.

A summary by each machine learning method is as follows:

Logistic Regression

Logistic regression models the probability of a binary response based on one or more predictor variables. For instance, whether someone will donate to the charity or will not donate.

Log models were created based on earlier models. The first model reflecting 21 variables and the second reflecting 9 variables that were selected based on their significance. The plot reflects the log function based on the profit.

Model: Log

Number mailed

Profit

Confusion Matrix

Accuracy

Log1

1,291

$ 11,642.50

	Pred: 0	Pred: 1
Actual: 0	709	18
Actual 1:	310	981

83.74628%

Log2

1,389

11,533.50

	Pred: 0	Pred: 1
Actual: 0	617	12
Actual 1:	402	987

79.48464%

Machine Learning: Logistic Regression

Variables Log2: chld + home + reg2 + I(hinc^2) + wrat + tdon + incm + tlag + tgif

Linear Discriminant Analysis (LDA)

LDA uses as Bayes classifier and a threshold of the posterior probability to assign observations to a default class by plugging in estimates for several parameters including the weighted average of the sample variances for each class and the number of observations. The first LDA model uses 21 variables including a transformation on hinc. LDA2 uses a subset of the most significant variables.

Model: LDA

Number mailed

Profit

Confusion Matrix

Accuracy

LDA1

1,329

$ 11,624.50

	Pred: 0	Pred: 1
Actual: 0	675	14
Actual 1:	344	985

82.25966%

LDA2

1,348

11,557.50

	Pred: 0	Pred: 1
Actual: 0	654	16
Actual 1:	365	983

81.11992%

Variables in LDA2: reg1, reg2, reg3, chld, home, I(hinc^2), wrat, npro, tdon, tlag

Quadratic Discriminant Analysis (QDA)

Quadratic Discriminant Analysis is similar to LDA as it assumes that the observations from each class are drawn from a Gaussian distribution creates a prediction based on plugging in parameter estimates into Bayes’ theorem. However, QDA uses a quadratic function and creates its own covariance matrix unlike LDA making QDA more flexible with a lower variance.

In each of the QDA models, a different set of variables were used with QDA1 using nine variables, QDA using all 21 variables and QDA3 using seven variables.

Model: QDA

Number mailed

Profit

Confusion Matrix

Accuracy

QDA1

1,469

$ 11,214.00

	Pred: 0	Pred: 1
Actual: 0	761	80
Actual 1:	258	919

83.25074%

QDA2

1,372

11,219.50

	Pred: 0	Pred: 1
Actual: 0	795	103
Actual 1:	224	896

77.94846%

QDA3

1,402

11,232.00

	Pred: 0	Pred: 1
Actual: 0	610	36
Actual 1:	434	968

76.95738%

Variables: QDA1: reg1, reg2, home, chld, I(hinc^2), wrat, npro, tdon, tlag; QDA3: chld, home, wrat, hinc, reg2, tdon, incm

K=Nearest Neighbors (KNN)

KNN assigns a weighting so the nearer ‘neighbors’ contribute more to the average than observations further away. A small K value will provide the most flexible fit and have a low bias but high variance due to the prediction relying on one observation and vice versa, a larger K value will provide a smoother and less variable fit but may cause bias by masking some of the structure.

Model: KNN

Number mailed

Profit

Confusion Matrix

Accuracy

KNN k=1

1,037

$ 11,306.00

	Pred: 0	Pred: 1
Actual: 0	847	67
Actual 1:	172	932

88.15659%

KNN k=2

1,114

11,155.50

	Pred: 0	Pred: 1
Actual: 0	828	76
Actual 1:	191	923

86.76908%

KNN k=3

1,101

11,573.00

	Pred: 0	Pred: 1
Actual: 0	868	49
Actual 1:	151	950

90.0892%

KNN k=4

1,101

11,573.50

	Pred: 0	Pred: 1
Actual: 0	837	44
Actual 1:	182	955

88.80079%

KNN k=5

1,130

11,718.00

	Pred: 0	Pred: 1
Actual: 0	853	35
Actual 1:	166	964

90.03964%

KNN k=6

1,142

11,737.50

	Pred: 0	Pred: 1
Actual: 0	844	32
Actual 1:	175	967

89.74232%

KNN k=7

1,142

11,813.50

	Pred: 0	Pred: 1
Actual: 0	857	28
Actual 1:	162	971

90.58474%

KNN k=9

1,150

11,852.00

	Pred: 0	Pred: 1
Actual: 0	845	23
Actual 1:	174	976

84.8696%

KNN k=10

1,157

11,765.50

	Pred: 0	Pred: 1
Actual: 0	833	28
Actual 1:	186	971

89.39544%

Generalized Additive Models (GAM)

GAM models go beyond linear models by allowing non-linear functions of each variable while being additive, meaning that the effect of changes in a predictor on the response is independent of the values of the other predictors. Thus, one of the assumptions is that interaction amongst various marketing activities are not creating some type of synergy whereby we need to consider an interaction term for example, when there is a mailing campaign combined with radio, television or other marketing activities. GAM uses smoothing splines via backfitting which allows multiple predictors by repeatedly updating the fit for each predictor.

Model: GAM

Number mailed

Profit

Confusion Matrix

Accuracy

1

1,036

$10,818.50

	Pred: 0	Pred: 1
Actual: 0	880	112
Actual 1:	139	887

87.56194252%

Machine Learning: Decision Trees Nodes

Variables: reg1, reg2, home, chld, I(hinc^2), wrat, npro, tdon , tlag , tdon

Decision Trees

Decision trees can be applied to both regression and classification problems and are commonly used for their ease of interpretability. The top of the tree is split into branches based of the best split at that particular junction without looking forward to determine if a more optimized split should be completed which is considered to be a ‘greedy’ approach. At each split the predictor space is split based on the greatest possible reduction in the RSS. As the training set is likely to create a too complex tree and overfit the data, an alternative is to build the tree as long as the decrease in the RSS exceeds a threshold.

Various decision trees were also modeled including 16, 9 and 5 terminal nodes. The results are as follows:

Model

Number mailed

Profit

Confusion Matrix

Accuracy

Decision tree – 15

1,168

$ 11,149.00

	Pred: 0	Pred: 1
Actual: 0	783	70
Actual 1:	236	929

84.83647%

Decision tree – 9

962

10,038.50

	Pred: 0	Pred: 1
Actual: 0	882	174
Actual 1:	137	825

84.5887%

Decision tree – 5

1,078

10,212.50

	Pred: 0	Pred: 1
Actual: 0	794	146
Actual 1:	225	853

81.61546%

Machine Learning: Decision Trees

Bagging

Bagging focuses on reducing the variance of a statistical learning method such as decision trees. Bagging can reduce the high variance and low bias of decision trees by combining hundreds or thousands of trees into one procedure by taking repeated samples from the training set and averaging all of the predictions.

Model: Bagging

Number mailed

Profit

Confusion Matrix

Accuracy

Mtry=20

1,037

$ 11,063.00

	Pred: 0	Pred: 1
Actual: 0	888	93
Actual 1:	131	906

88.8999%

Mtry=10

1,050

11,167.50

	Pred: 0	Pred: 1
Actual: 0	884	84
Actual 1:	135	915

89.14767%

Mtry=6

1,063

11,301.00

	Pred: 0	Pred: 1
Actual: 0	882	74
Actual 1:	137	925

89.5441%

Mtry=5

1,055

11,143.00

	Pred: 0	Pred: 1
Actual: 0	879	86
Actual 1:	140	913

88.80079%

Random Forest

Improving bagging by tweaking the splits in the decision tree is the basis for Random Forests. Based on a random sample of predictors is chosen from the full set of predictors with a fresh sample of predictors being taken at each split and thus, makes the average of the resulting trees less variable and more reliable.

Additionally, the Importance function in R, provides a list of the most important variables. As shown below, this list was used in the choice of variables in each of the Random Forest Models as reflected with the ‘mtry’ selection. For instance, mtry=2 uses the two best variables, chld and home.

Machine Learning: Random Forest

Model: Random Forest

Number mailed

Profit

Confusion Matrix

Accuracy

Mtry=5

1,059

$ 11,149.50

	Pred: 0	Pred: 1
Actual: 0	878	85
Actual 1:	141	914

87.0998%

Mtry=3

1,065

11,152.00

	Pred: 0	Pred: 1
Actual: 0	873	84
Actual 1:	146	915

88.60258%

Mtry=2

1,101

11,341.00

	Pred: 0	Pred: 1
Actual: 0	852	65
Actual 1:	167	934

88.50347%

Neural Network

A neural network mimics the learning pattern of natural biological neural networks. They begin with a single perceptron that receives inputs, applies a weighting then passes into a activation function to produce an output which is then layered to create a network. Hidden layers are the layers between the input and output layers where you cannot see the inputs or outputs (Portilla, Jose. 2016. A Beginner’s Guide to Neural Networks with R! Retrieved from kdnuggets.com).

Model

Number mailed

Profit

Confusion Matrix

Accuracy

Neural Network – 10 10 50

1,047

$ 10,767.50

	Pred: 0	Pred: 1
Actual: 0	862	157
Actual 1:	114	885

86.57086224%

Neural Network – 10 10

1,054

10,855.00

	Pred: 0	Pred: 1
Actual: 0	859	160
Actual 1:	105	894

86.86818632%

Machine Learning: Neural Network 10 10

Machine Learning: Neural Network 10 10 50

Support Vector Machine

Support vector machines are intended for binary classifications such as donr and when there are non-linear class boundaries which is addresses by enlarging the feature space using polynomial functions as predictors using kernels. A kernel is a function that quantifies the similarity of two observations. A radial kernel is effected by nearby training observations but there are also other types including linear, polynomial and sigmoid.

Model: Support Vector Machine

Number mailed

Profit

Confusion Matrix

Accuracy

Tune.out Kernel=Linear

816

$ 8,750.00

	Pred: 0	Pred: 1
Actual: 0	919	283
Actual 1:	100	716

81.0208127%

Tune.out2 Kernel=Linear

1,060

10,422.50

	Pred: 0	Pred: 1
Actual: 0	824	134
Actual 1:	195	865

83.6967294%

Tune.out3 Kernel=Linear

1,056

10,401.50

	Pred: 0	Pred: 1
Actual: 0	826	136
Actual 1:	193	863

83.6967294%

Tune.out4

Kernel=polynomial

1,068

9,957.00

	Pred: 0	Pred: 1
Actual: 0	785	165
Actual 1:	234	834

80.227948%

Prediction Modeling

Goal: Minimize the prediction error based on the donation amount

Quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation. This is measured using the mean squared error (MSE). The MSE will be small if the predicted responses are close to the true responses.

Least Squares Regression

Least squares regression measures closeness by minimizing the sum of squared errors by minimizing the residuals

Three models using PLS regression with LS1 using all of the variables, LS2 using all except genf and wrat but added the transformed hinc2 and LS3 used the most significant factors as determined by the Random Forest model.

Model	Mean Prediction Error	Standard Error
LS1	1.867523003	0.1696615221
LS2	1.973121015	0.1720099842
LS3	1.857983465	0.1699141406

Variables: LS3 chld, hinc, reg2, home, wrat, tdon, tgif, incm, npro, tlag, avhv, inca, plow, agif, lgif, rgif, reg1, reg4, reg3).

Principal Component Regression

The principal Component Regression reflects the trade between the number of components that explain most of the variability in the data that is fit using least squares.

Training Variance Explained by the number of components is shown below which reflects the percentage of variance explained by the number of different components i.e. using half of the components explains 77.67% of all variance. The plot also reflects ‘elbows’ where the variance is:

	1	2	3	4	5	6	7	8	9	10
X	16.08851	27.91	36.73	45.01	51.11	56.8	62.35	67.59	72.71	77.67
damt	0.02819	28.46	28.54	36.52	46.58	47.08	48.92	49.23	49.36	49.57

	11	12	13	14	15	16	17	18	19	20
X	82.46	87.1	90.46	92.7	94.8	96.27	97.61	98.64	99.57	100
damt	49.77	51.17	51.98	51.98	56.49	56.5	56.58	56.58	57.18	57.22

Model – # Components	Mean Prediction Error	Standard Error
PCR – 2	2.953948616	0.2225891948
PCR – 3	2.946779495	0.2230407325
PCR – 5	2.15543888	0.1864263297
PCR – 10	2.079405939	0.1839018125
PCR – 15	1.865496754	0.1698902112
PCR – 20	1.867523003	0.1696615221

Machine Learning: Principal Components Regression

The table above reflects standardized predictors with a ten-fold cross-validation error for each possible value.

Based on the MPE, a PCR with 15 components results in the lowest MPE.

Partial Least Squares (PLS) Regression

Partial Lease Squares is a dimension reduction method that first identifies a new set of features that are linear combination of the original features and then using least squares fits a linear model. It considers the predictor variables that are most strongly related to the response variable.

PLS – # of Components	Mean Prediction Error	Standard Error
6	1.868145	0.1700303
3	1.815055	0.1701451
2	1.852769	0.1712954

Best Subset Selection with K-fold Cross-Validation

Best subset selection fits a separate least squares regression to each possible combination of the predictors and chooses the best model based on the smallest RSS.

Results in a ten-variable model consisting of: reg3, reg4, home, chld, hinc, incm, plow, npro, rgif, agif.

Model – Best Subset w/ k-Fold Cross Validation	Mean Prediction Error	Standard Error
10 Variables	1.812687159	0.1686750592

The best subset selection plot, as shown below, reflects the top row with all of the variables according to the optimal model. The optimal models are based on four different statistical methods, R-squared, Adjusted r-squared, Mallow’s Cp and Bayesian Information Criterion (BIC).

Machine Learning: Best Subset Selection

Ridge Regression

Expanded the grid from to cover a whole range of scenarios. The variables are standardized by default. Using the best function, the best value of lambda is applied using cross-validation.

Ridge Regression Model	Mean Prediction Error	Standard Error
Best = 0.1107589	1.873351	0.1711236

The Lasso

As ridge regression will always build a model with all of the predictors, Lasso overcomes this by performing variable selection that lead to the smallest RSS subject to a set constraint. Thus, for every value of there is a lasso coefficient estimate. It is closely related to the best subset selection.

Lasso Model	Mean Prediction Error	Standard Error
Best = 0.00877745	1.86133	0.1694185

Fitting a lasso model, we can see that depending on the choice of the tuning parameter, some of the coefficients will be exactly zero.

Lasso

Results

The best models for classification and prediction based are addressed in two separate parts.

Classification

As reflected above in the ten classification models (Logistic Regression, GAM, LDA, Log, Decision Trees, QDA, Bagging, Random Forest, KNN, Neural Network and Support Vector Machine), the results on the training and test / validation data provided different results in determining how to maximize the gross margin on the marketing campaign.

Based on the accuracy on determine which donors are most likely to donate, the top three models are:

K-Nearest Neighbors with k=7 resulting in 90.58% accuracy on the test data
Bagging using 6 variables at each split resulting in 89.54% accuracy on the test data
Random Forest with 5 variables at each split resulting in 88.80% accuracy on the test data

The best classification models based on Gross Margin are as follows:

K-Nearest Neighbors with k= 9 resulting in $11,852.00
Logarithmic resulting in $11,642.50
Linear Discriminant Analysis resulting in $11,624.50

For future data collection, some variables have proven to be more important than other variables which may assist in reducing costs and /or replacing with other variables that may be more beneficial to the marketing efforts. For example, Random Forest determined that the number of children a donor has and whether they are a home owner were important in determining if they would donate. Additionally, while most of the regions did not play a huge role in determining whether a donor would donate, region 2 was more important than the other three regional variables included in the data set.

Prediction

As reflected above in the six prediction models (Partial Least Squares, Principal Component Regression, Best Subset, Ridge and Lasso), the Mean Prediction Error (MPE) and Standard Error (SE) on the different data sets provided different results in determining the amount of the donation. The top three prediction models were:

Best Subset with a MPE of 1.8126 and a SE of .168675
Partial Least Squares with 3 with a MPE of 1.8150 and a SE of .1701451
Lasso with a MPE of 1.86133 and a SE of .1694185

In order to maximize the value of the list of people who have donated to the charity we need to determine the two parts of the equation a) who will donate and b) how much. This is accomplished by applying various models to determine which model predicts who will donate which is reflected in the gross margin value of the model and the amount of the donation using the mean prediction error. Based on the above, the best models to use are a) K-Nearest Neighbors with k=7 to determine who will donate and b) the best model for the donation amount which is the Best Subset Selection based on ten variables and also has the lowest standard error which tells us the average amount of the differences from the actual values.

Machine Learning: Charity Donor Analysis Conclusions

Utilizing the various training, validation and test datasets has provided insight into the type of data that is collected by Charity Inc. and how it can be utilized to benefit the company’s marketing campaign. The exploratory data analysis reflected the different types of data that were contained within the datasets along with their statistical information and plots for ease of reference.

Next, a variety of machine learning and regular models were produced for both classification and prediction modeling. The majority of the models detailed various iterations along with the number of mailouts, profit, confusion matrix and accuracy ratio. While there are numerous other models that could also be included along with additional iterations and transformations of variables this version has tried to remain easily interpreted.

Based on the analysis, the K-Nearest Neighbor is the best model for prediction error while the Best Subset Selection model is best for determining which people on the mailing list are most likely to donate. Based on the test data, the marketing campaign should be focused on 354 donors who will on average donate $14.50 for a total predicted donation amount of $5,132. Additionally, through the modelling efforts contained within, Charity Inc. can be more confident in determining who should be targeted to maximize revenues while minimizes costs.

In future iterations, segregating donors based on their projected contributions could be helpful in customizing mailout and mailout materials (and related costs) to future maximize the charities marketing efforts and resulting donations.

Machine Learning: Charity Donor Analysis

Machine Learning: Charity Donor Analysis

Introduction

Analysis

Exploratory Data Analysis

Correlations

Logistic Regression

Classification Modeling

Logistic Regression

Linear Discriminant Analysis (LDA)

Quadratic Discriminant Analysis (QDA)

K=Nearest Neighbors (KNN)

Generalized Additive Models (GAM)

Decision Trees

Bagging

Random Forest

Neural Network

Support Vector Machine

Prediction Modeling

Least Squares Regression

Principal Component Regression

Partial Least Squares (PLS) Regression

Best Subset Selection with K-fold Cross-Validation

Ridge Regression

The Lasso

Results

Classification

Prediction

Machine Learning: Charity Donor Analysis Conclusions

Related Posts

Leave a Reply Cancel reply