Customer Segment Analysis with k-Means Clustering

I participated in a group project with five others where we analysed the use of unsupervised learning in customer segmentation. We imported a Dataset from GitHub and cleaned it, used feature engineering and chose an unsupervised learning model to group the data. We chose to use K-Means Clustering to group them based on RFM analysis. We then outlined our findings and the limitations of the model. We used python IDE for our analysis.

Data Preprocessing

We converted our database to a pandas data frame and removed outliers and impossible values. We then used z-score and interquartile range to make the data more accurate. We then removed any duplicate entries, as this could result in errors in the clustering accuracy. We then created a correlation matrix to help avoid misinterpretation of our data.

Feature Engineering

Once the data has been cleaned, we move onto feature engineering, we will use Recency, Frequency and Monetary (RFM) analysis. we will need to create features measuring the time since the last transaction for each customer ID (Recency), total transactions for each customer ID (Frequency) and total spent for each customer ID (Monetary).

Unsupervised Machine Learning Model: K-Means Clustering

We used customer behavioural data to group customers based on similarities. First, the optimal number of clusters needed to be found, we used the elbow method for this and identified 3 as optimal. Using the optimal number of clusters we can fit the k-means model to our standardised data set. This will group each customer in one of the 3 clusters, we take these cluster labels and add them to the RFM tables to begin visualizing the clusters.

Findings

We used boxplots to analyse our RFM groupings. Clustered by Recency, Frequency and Monetary. This allowed us to identify three groups of consumers, once-off buyers (0), customers (1) and very loyal customers (2). This is very clearly seen in the 'Frequency and Monetary cluster graphs as clusters '2' are spending the most money and visit the site most frequently (Appendix 1). We reccommend that the business focus efforts on the loyal customers of those in the very loyal customers category in order to maintain their support. We also reccommend that they may need to put in place discounts for group purchases to encourage once off buyers to purchase more.

Limitations

  • One of the primary limitations of K-means clustering is its sensitivity to outliers in the dataset. To mitigate this limitation, we have implemented code to remove outliers by defining z-scores and interquartile ranges.
  • K-means clustering assigns data points to one cluster only. In the real world, some data points may belong to multiple clusters.
  • By setting an initial number of clusters to three using the elbow method, we potentially limit the outcome of our algorithm as we cannot be certain of our choice of clusters.

Click here for the full report, dataset and code.