Clothing store
Clothing trans Your task is to predict which customers are most likely to respond to a direct mail marketing promot...
2018
03/10
 
  Partecipanti 7 Sottomissioni 139  
 

Your task is to predict which customers are most likely to respond to a direct mail marketing promotion (classification problem). Company managers required the model comparisons be made in terms of cost/benefit analysis. A performance score that assesses the effect of the classification model on the business’s profit will therefore be applied. The cost/benefit table for this clothing store marketing promotion example is the following:

True negative (predicted: Nonresponse, actual: Nonresponse) $0 (No contact, no
lost profit)
True positive (predicted: Response, actual: Response) +$26.4 (Estimated profit minus cost of mailing)
False negative (predicted: Nonresponse, actual: Response) -$28.4 (Lost profit)
False positive (predicted: Response, actual: Nonresponse) -$2 (mailing cost)

Given the following cost/benefit table

True negative (predicted: Nonresponse, actual: Nonresponse) $0
True positive (predicted: Response, actual: Response) +$26.4
False negative (predicted: Nonresponse, actual: Response) -$28.4
False positive (predicted: Response, actual: Nonresponse) -$2

we define

SCORE = OVERALL PROFIT / NUMBER OF CUSTOMERS.

For example, the model “send a marketing promotion to everyone” will have on the test set of 5376 subjects (4509 nonrespond and 867 respond) 0 TN, 867 TP, 0 FN and 4509 FP, for a SCORE of (867*26.4 – 4509*2)/5376 = 2.580134 $.

This is the benchmark profit that any predictive model should outperform.

During the competition, the leaderboard displays your partial score, which is the score for 2688 (random) subjects of the test set. At the end of the contest, the leaderboard will display the final score, which is the score for the remaining 2688 subjects of the test set. The final score will determine the final winner. This method prevents users from overfitting to the leaderboard.

train <- read.csv(“train.csv”)
test <- read.csv(“test.csv”)

fit = glm(RESP~.,train, family=“binomial”)
phat = predict(fit, newdata=test, type=“response”)
yhat = ifelse(phat >0.5,1,0)

write.table(file=“mySubmission.txt”, yhat, row.names = FALSE, col.names = FALSE)

The clothing-store data set represents actual data provided by a clothing store chain in New England. Data were collected on 51 variables for 28799 customers. For this dataset we have a partition of approximately 75% training and 25% test.

The clothing-store data set contains information about the following 51 variables:

  • HHKEY Customer ID: unique, encrypted customer identification
  • ZIP_CODE Zip code
  • FRE Number of purchase visits
  • MON Total net sales
  • AVRG Average amount spent per visit
  • AMSPEND, PSSPEND, CCSPEND, AXSPEND Amount spent at each of four different franchises (four variables)
  • OMONSPEND, TMONSPEND, SMONSPEND Amount spent in the past month, the past three months, and the past six months
  • STORELOY Amount spent the same period last year
  • GMP Gross margin percentage
  • PROMOS Number of marketing promotions on file
  • DAYS Number of days the customer has been on file
  • FREDAYS Number of days between purchases
  • MARKDOWN Markdown percentage on customer purchases
  • CLASSES Number of different product classes purchased
  • COUPONS Number of coupons used by the customer
  • STYLES Total number of individual items purchased by the customer
  • STORES Number of stores the customer shopped at
  • MAILED Number of promotions mailed in the past year
  • RESPONDED Number of promotions responded to in the past year
  • RESPONSERATE Promotion response rate for the past year
  • HI Product uniformity (low score = diverse spending patterns)
  • LTFREDAY Lifetime average time between visits
  • CLUSTYPE Microvision lifestyle cluster type
  • PERCRET Percent of returns
  • CC_CARD Flag: credit card user
  • VALPHON Flag: valid phone number on file
  • WEB Flag: Web shopper
  • PSWEATERS, PKNIT_TOPS, PKNIT_DRES, PHATS, PJACKETS, PCAR_PNTS, PCAS_PNTS, PSHIRTS, PDRESSES, PSUITS, POUTERWEAR, PJEWELRY, PFASHION, PLEGWEAR, PCOLLSPND, PC_CALC20 : 15 variables providing the percentages spent by the customer on specific
    classes of clothing, including sweaters, knit tops, knit dresses, blouses, jackets,
    career pants, casual pants, shirts, dresses, suits, outerwear, jewelry, fashion,
    legwear, and the collectibles line; also a variable showing the brand of choice
    (encrypted).
  • RESP Target variable: response to promotion

One of the variables, the Microvision lifestyle cluster type, contains the market segmentation category for each customer as defined by Claritas Demographics. The six most common lifestyle cluster types in our data set are:

1. Cluster 10: Home Sweet Home—families, medium-high income and education,
managers/professionals, technical/sales
2. Cluster 1: Upper Crust—metropolitan families, very high income and education,
homeowners, manager/professionals
3. Cluster 4: Midlife Success—families, very high education, high income, managers/
professionals, technical/sales
4. Cluster 16: Country Home Families—large families, rural areas, medium education,
medium income, precision/crafts
5. Cluster 8: Movers and Shakers—singles, couples, students, and recent graduates,
high education and income, managers/professionals, technical/sales
6. Cluster 15: Great Beginnings—young, singles and couples, medium-high education,
medium income, some renters, managers/professionals, technical/sales




Train train.csv
3 MB
Test test.csv
1000 KB
Per partecipare bisogna prima autenticarsi
# Nome Punteggio Prove Ultima prova
1 agila5 FINALE 2.96 6 05.10.2018
12:25
2 k.olobatuyi FINALE 2.93 87 05.10.2018
15:12
3 p.maranzano FINALE 2.88 32 05.10.2018
12:01
4 anna.comotti91 FINALE 2.53 8 05.10.2018
15:52
5 solari.aldo FINALE 2.53 3 25.09.2018
19:00
6 benchmark FINALE 2.53 3 25.09.2018
11:24