Starbucks Capstone Challenge

Image taken from https://de.wikipedia.org/wiki/Starbucks#/media/Datei:Starbucks_Logo_ab_2011.svg

I fought myself trough the Udacity — Data Scientist Nanodegree Program trying to be a better me. I got frustrated, I got motivated and I got halfway trough and this is my post of the Starbucks Capstone Challenge.

Project Definition

Project Overview

This project is about members of the Starbucks application, which get offers of different types, different difficulty and different via different media.

The goal of this project is to use everything I have learned to combine transaction, demographic and offer data of a simplified version of the real Starbucks app. I tried to implement an algorithm, which can predict, if an offer will be completed and I tried to implement another algorithm, which can determine the offer type for the given variables.

The approach and the results are documented here.

The results can be split into three subcategories:

  1. Results via EDA:
    The conclusion is that the best demographic group for resopnses of offers are female members, which are a little bit older, loyal members and gaining more money than the average. The offer type has no real impact.
  2. Results for predicting, if an offer will be succesfull:
    The prediction, if an offer willbe successfull was quite good.
  3. Results for predicting the offer type:
  4. The prediction of the offer type was not that good.

The project description is available here:
https://classroom.udacity.com/nanodegrees/nd025/parts/84260e1f-2926-4127-895f-cc4432b05059/modules/78dd932d-67a7-4039-9907-f8e6211e4590/lessons/d6285247-6bc0-4783-b118-6f41981b9469/concepts/480e9dc2-4726-4582-81d7-3b8e6a863450

The finished project is available here:
https://github.com/erenaltun91/starbucks_capstone_challenge

Problem Statement

The problem of this project is, that the datasets have to be preprocessed. This preprocession is kind of tricky and needs attention.

Customer behavior on the Starbucks rewards mobile app is tracked and requires a high amount of wrangling, cleaning and transformations to a point where I can use it to make assumptions, predictions or heuristics of it.

First the three different datasets have to be cleaned and then merged together to get information about customers and the underlying demographic characteristics.

After getting the cleaned dataset the biggest problem was to prepare the clenaed datasets for the different algorithms and to find out, why my classifier suffers from overfitting.

Metrics

accuracy — For measuring the performance of the models I used the accuracy. The accuracy is a measurement, which determines how many given real target values match with the predicted values. The higher the accuracy, the better model but a accuracy of 100% displays overfitting.

Analysis

Data Exploration and Visualization

The data provided by Udacity was simplified for this project and split in three data sets:

  1. The profile dataset —
    Contains demographic data like the gender, age, income and the id of the customer.
  2. The portfolio dataset —
    Contains data about the offers, the type of the offers, the duration of the offer, the reward for the offer, the difficulty, the channels and the id of the offer.
  3. The transcript dataset —
    Contains data about the persoon, the information about what happened to the offer (received, viewed, completed etc.) and a value column, which contains an offer_id or an “amount” if a transaction was made or a “reward” when an offer was completed.

This data set was really messy. I needed to clean it and transform a few columns to get information out of them. A lot of dummy variable creation has been made. The age was problematic too, because there was the age “118” which was an encoding for missing data. The hardest part was the extraction of information from the “value” column. After all the cleaning I merged the three dataframes to one and extracte the offers from the transactions. In the end I extracted the completed offers to make assumptions of the demographic characteristics.

Now we can ask our questions of interest and begin with the Exploratory analysis by looking at one variable at a time.

Profile data:

  1. How is the distribution of the age of participants of the starbucks application?
Age distribution

We can see, that the most customers are between 50 and 70 years old, which is older than expected. The mean of the age lies at 54 years and the most present age is 58 years. 50 % of the customers are younger than 55 years.

2. How is the distribution of the income of participants of the starbucks application

Income distribution

The most customers have an income of about 70,000 USD and the number of customers decreases with increasing income. The mean of the income is about 65,000 USD and the the most present income is 73,000 USD. 50 % of the customers have an income smaller than 64,000 USD.

3. How is the distribution of the gender of participants of the starbucks application

Gender distribution

The proportion of male participants with about 57 % is 15 percent points higher than the proportion of female participants with about 41 %.
Other customers make up about 1.5 % of the total customers.

4. What is the distribution of the membership?

A clear increase in members is seen by the distribution over the years starting from 2013 to 2018. From 2013 to 2017 there was a alomost two fold increase every year. From 2017 to 2018 there was a decrease of about 50%.

Portfolio data:

1. How many different offers exist and what are the proportions of the offer types?

There are 10 different offers with the same proportion of the “discount” and “bogo” offer type (40 %). The other 20 % are made up by the informational

Transcript data:

  1. How is the distribution of the events from offers?
Event distribution

The most present present event is the transaction, whereas offer completed is the least present one. The transaction event has a proportion of about half of the events. Almost a quarter of the events are received offers. Only 19 % of the events are viewed. The lowest proportion of events is the offer completed event.

In the multivariate analysis we dig deeper into the data and try to combine different variables to answer specific questions like:

  1. How is the distribution of the income between the genders?
Income distribution between genders

The highest amount of females have an income of about 80,000 USD but the amount of females with low and high income is balanced well. Male members peak at 60,000 USD and the amount of males decreases after 70,000 USD and decreases further with higher income. Members who chose the “Other” gender have the highest amount at also about 60,000 USD but maintain a better balance between low and high income.

2. How is the distribution of the age between the genders

Age distribution between genders

Female members peak between the age of 50 and 70. Males also peak between the same age but more young members (age of 20 to 40) are males.
Other members have the same distribution of age like the female members.

3. Was there a year of membership, which was dominantly used by a specific gender?

Membership distribution between genders

From 2013 to 2015 male members started to use the app more. In 2016 the number of females becoming members exceeded the number of male becoming members for the first time. In 2017 this trend changed again. In 2018 the number of new female members decreased to the same point in year 2016, whereas the new male members maintained more.

Now we can look into the offer data and get information about the variables, which affect the completion of an offer and the choosing of an offer type.

1. How does the gender influence the completion of an offer?

Event distribution between genders

We can see, that the total amount of completed offers is slightly higher for males. But relatively the amount of received and viewed amount are the same mor male and female members. The proportion of completed offers is higher for the female members.

2. Is there an impact on membership duration for a better response on offers?

Members of the years 2013 to 2016 viewed more of the offers received and thus completed more of them. The newer members got more offers but viewed them less and completed them less

3. Are there any age groups which lead to a better offer completion?

Event distribution in age groups

We can see that people between the age of 40 and 60 receive the most offers. People older than the age of 60 complete offers best. The youngest group completes offers the worst.

4. Is there an income cluster which completes offers better?

Event distribution in income groups

The low income groups complete less offers than higher income groups and the best completion rate is in income ranges from 80,000 to 120,000 USD

5. What are the general statistics of an completed offer?

Statistics of the completed offer

6. Is there an influence from the income on the completed offer types

Almost all income groups completed the bogo and discount offer type similar. The income clusters 40,000 to 80,000 prefered the discount more than the bogo.

7. Does the age affect the completed offer types?

- All age groups prefered the discount offer type over the bogo offer type.

8. Is there an specific offer type for a specific gender?

All genders prefer the discount offer.

9. Do longer members prefer different kinds of offers?

All members prefer the discount offer slightly more

Methodology

Data Preprocessing

First I made dummy variables out of the offer type and the channels column. To be consistent I also renamed the id column for avoiding problems with merging later.

Cleaned portfolio data

Then I discovered that the age 118, whoch was described as a coded missing value, had other missing values in the other columns. Because of that I got rid of all the rows, which contained the age 118. Furthermore I reformatted the date of the membership column and again renamed the id column for consistency reasons

Cleaned profile data

The last dataframe was a bit tricky. I began by creating dummy variables out of the events. The tricky part was the creation of a “offer_id” and “amount” column out of the “value” column, which consists of dictionaries. After calculating for an eternity I just had to rename the “person” column into “customer_id” for consistency reasons and for the merging step.

Cleaned transcript data

I merged the profile and transcript data sets via the customer_id and then I merged this new dataset via the offer_id to get the full merged dataset.

Merged data

After merging the datasets I understood, that assigning the offer types (bogo, discount, informational) to a certain demographic group can be hard, because the “transaction” event had to assignment of the offer type. Therefore I split the merged dataframe into an “offers” dataset (with assignment to offer types) and into a “transaction” dataset.

Offer data

To specify the demographic variables for better responses I had to cluster the age and income variables for better visualisation.

As a last step I extracted the completed offers from the offers dataset and build a customer table, where all statistics of the customers can be seen. I used this customer dataset to decide which offer type should be given to which person.

After this exploratory data analysis I wanted to predict, if an offer will be completed and furthermore I wanted to predict the offer type.

For the prediction of the completion of an offer I had to create an offer table with the offers per customers and the value if the offer was completed. I generated a lot of dummy variables. An offer counts as completed, if the customer has seen it and completed it.

For the prediction of the offer type I had to create the same table but with different variables. The variables had to be categorical.

Implementation

Functions:
I implemented the extraction of information ofthe “ value” column in the transcript dataset. The “offer id / offer_id” and the “amount” should be extracted of the value column. The “reward” information could be ignored, because it was contained in an other dataset. This take so long, that is why I made it one time and saved all cleaned and transformed datasets to pickles, which I loaded to visualize the data.

Furthermore I implemented two functions for statistics. One is the “build_customer_statistics()” which returns a dataframe (table) with the summed up information of the customer. Addable variables were summed up and categorical (gender, etc.) or not summable variables (income, etc.) were just extracted.

The “generate_general_metrics()” function generates general metrics about customers and the offers.

The function “create_offer_table_pred_completion()” processes the data further and prepares it for the modeling process. Here I used the One-Hot encoding for categorical variables and split them into dummy variables. The main point in this function is the creation of the future target variable “offer_viewed_completed”, which should determine, if an offer was viewed by the customer and then completed. This is an indicator for success.

For the prediction of the offer type I processed the offer data with the function “create_offer_table_pred_offer_type()”, which handles the categorical data different. It uses the LabelEncoder() class (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) to label all categorical variables. The target variable was the offer type.

Models:
To predict the completion of an offer I split the data into training and test data, with the “offer_viewed_completed” as the target. Here I used the “RandomForestClassifier()” (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), with different parameters. The RandomForest-Algorithm is a supervised machine learning algorithm which uses a high number of decision trees (high number of trees = forest) to estimate an outcome. It takes the average of the all decision trees to increase the accuracy.

To predict the offer type I used “KNeighborsClassifier()” (https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html) and a Support Vector Machine (https://scikit-learn.org/stable/modules/svm.html#svm). KNeighborsClassifier uses the distance between the points in different dimensions. The Support Vector Machines try to find the highest margin between two classes.

Refinment

In the refinment porcess for the models, I tried a lot of different parameters for the different models.

For the “RandomForestClassifier()” I tweaked the max_depth and the number of estimators. The higher the number of estimator and the higher the max depth, the better the accuracy but it alos takes more time to fit the data.

I tried different number of neighbors in the “KNeighborsClassifier()” to improve the result of the classification.

Different kernels were used with the SVM to get better models to predict the offer type.

Results

The question of interest was, which demographic characteristics describe theresponse of offers best. The analytic results show, that female person, older than 50 years as a loyal member and has an income of more than 80,000 USD.

Distribution of different demographic variables for events

Almost all income groups completed the bogo and discount offer type similar. The income clusters 40,000 to 80,000 USD prefered the discount more than the bogo. All age groups prefered the discount offer type over the bogo offer type. All genders prefer the discount offer. All members prefer the discount offer. The informational offer is not even in the dataset.

Distribution of different demographic variables for offer types

The models for the prediction if an offer will be completed were trained with the “RandomForestClassifier()” and had an accuracy between 74,42 % and 75,99 %. The accuracy changed due to following parameter combinations:

- max depth: None ( nodes are expanded until all leaves are pure OR all leaves contain less than 2 samples)
- n_estimators: 100
- accuracy: 74,42 %

- max depth: 5
- n_estimators: 100
- accuracy: 73,88 %

- max depth: 10
- n_estimators: 100
- accuracy: 76,13 %

- max depth: 10
- n_estimators: 1000
- accuracy: 75,99 %

With higher number maximum depth, the accuracy increased. The increase of number of estimators resulted in a slight decrease of the accuracy.

The accuracy for the offer type classification with the KNeighborsClassifier() was always 1. A sign of overfitting and problems with the low number of training data.

The classification model with the SVM had an accuracy of about 39 %, which was not really good. Even with different kernels, the result was the same.

Justification

I chose the visuals to get a hang of the data.

To predict the offer completion I chose the decision trees because they can be used for regression.

For the prediction of the offer type I tried different classification algorithms.
I taught, that the KNeighborhoodClassification will be a good algorithm, becuase it calculates the distance between points but with an accuraccy of 100 % it will not be of any help, due to overfitting. The Support Vector Machine was chosen, because it should deal with real life data well but the algorithm was disappointing, due to the low accuracy level.

Conclusion

Reflection

This project started with three datasets and a lot of cleaning, wrangling and transformation work to do. This was the hardest part in my opinion. The cleaning was easy but the wrangling and extracting of the data was hard and took a lot of time. To find the right transformations was also not very easy. Another hard part was to find the right visualizations to find the answer of the problem. After discovering new necessary visualizations, new transformations had to be made. After fighting trough all that, the visualization part was kind of fun and showed insights into the whole project and the world of data. So I found out, that the most responsive members are female loyal members, with an income higher than the average.
After the visualisations I had to find a way to predict the completion of an offer and it was not very easy to transform the data in a way I could get a model out of it. Choosing the right algorithm and transforming the data again for matching the algorithm requirements was very hard.
The “RandomForestClassifier()” performed quite good, whereas the “KNeighborsClassifier()” always resulted in an accuracy of 100% which could depend on the curse of dimensionality and the low number of data. The most disappointing algorithm was the SVM, because it performed with only 39% accuracy. So, the prediction of a completion of an offer could work well, whereas the classification of an offer type will not probably work.

Improvement

The transformation took quite a lot of time, so I think this for loop could be solve better with vectorized operations.

Other models for prediction and classification could be used. Maybe some which could handle a small amount of training data better.

More different models could be used to see the differences between them.

The data and the goal of clean data should be done from the beginning and not during the process.

This was my project. Feel free to give me Feedback!

References

  1. https://classroom.udacity.com/nanodegrees/nd025/parts/84260e1f-2926-4127-895f-cc4432b05059/modules/78dd932d-67a7-4039-9907-f8e6211e4590/lessons/d6285247-6bc0-4783-b118-6f41981b9469/concepts/480e9dc2-4726-4582-81d7-3b8e6a863450
  2. https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html
  3. https://www.javatpoint.com/machine-learning-random-forest-algorithm
  4. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforest#sklearn.ensemble.RandomForestClassifier

Hey y’all, my name is Eren. I am 29 years old and a Bioinformatician from Berlin. I love to dig into data and get cool insights or get some ML algos to work :)