"Churn Rate" is a business term describing the rate at which customers leave or cease paying for a product or service. It's a critical figure in many businesses, as it's often the case that acquiring new customers is a lot more costly than retaining existing ones (in some cases, 5 to 20 times more expensive).
Understanding what keeps customers engaged, therefore, is incredibly valuable, as it is a logical foundation from which to develop retention strategies and roll out operational practices aimed to keep customers from walking out the door. Consequently, there's growing interest among companies to develop better churn-detection techniques, leading many to look to data mining and machine learning for new and creative approaches.
Predicting churn is particularly important for businesses w/ subscription models such as cell phone, cable, or merchant credit card processing plans. But modeling churn has wide reaching applications in many domains. For example, casinos have used predictive models to predict ideal room conditions for keeping patrons at the blackjack table and when to reward unlucky gamblers with front row seats to Celine Dion. Similarly, airlines may offer first class upgrades to complaining customers. The list goes on.
This is a post about modeling customer churn using Python.
One of the motivations for modelling this is to consider what would happen in a telecoms setting.
from __future__ import division
import pandas as pd
import numpy as np
churn_df = pd.read_csv('data/churn.csv', )
col_names = churn_df.columns.tolist()
print("Column names:")
print(col_names)
to_show = col_names[:6] + col_names[-6:]
print("\nSample data:")
churn_df[to_show].head(6)
Survival Analysis
One interesting way is to use Survival analysis. The basic idea of 'survival analysis' is to estimate using some fancy statistics the 'survival curve'. In the case of telecommunications this is when a subscriber will leave the service. In the case of HR - the question might be 'what characteristics do employees who leave our company have?'. There are many interesting applications of survival analysis
#Firstly let us look at the column headings.
churn_df.columns.tolist
#We noted above that Churn had string values. The lifelines library that we'll use doesn't like that and prefers 1 and 0.
# So we transform this.
d = {'True.':1, 'False.':0}
churn_df['Churn?'] = churn_df[['Churn?']].applymap(lambda x: d[x])
#Let us look again at the head of the data.
churn_df.head()
We can see that some subscribers in this list have not churned -- censorship!
Before we dig in to the data, I first need to introduce the first mathematical creature in survival analysis: the survival function!
Survival function
Define an subsribers lifetime, defined as the time between when they first purchase the subsription and when they churned, as capital $T$. Let small $t$ represent number of days from when they were first subsribe that is, since there "birth". Then the survival function, $S(t)$, is defined as:
$$ S(t) = P(T > t ) $$ What is the probability that a randomly chosen individual from the population lasts longer than small t?
The survival curve actually gives as a perfect description of the lifespans of a population. But, its never given to us, we need to estimate it using the data at hand.
Kaplan-Meier estimate
IMO, the best way to estimate the survival function is using the Kaplan-Meier estimate. It's nonparametric, which means we don't assume the data follows any particular form:
$$\hat{S(t)} = \prod_{i=0}^t \left(1 - \frac{d_i}{n_i}\right), \;\; \text{for all $t$}$$
where $d_i$ are number of deaths at time $i$, and $n_i$ are the number of individuals in the population who are at risk of dieing. Note that the above formula is for a specific $t$: if we compute this estimate over all $t$, then we get a curve - we'll see this later.
This formula can be derived from the following logic:
$$P( T = 0 ) \approx \frac{d_0}{n_0}$$
$$ \Rightarrow P( T > 0 ) \approx \left(1 - \frac{d_0}{n_0} \right) $$
$$ P( T > 1 ) = P( T > 1 \;|\; T > 0 )P( T > 0 ) \\ \approx \left(1 - \frac{d_1}{n_1}\right)\left(1 - \frac{d_0}{n_0}\right)$$
and so on...
How are censored individuals dealt with in the Kaplan Meier estimate? They are still part of denominator (as they are at risk of dieing), but don't count into the numerator (as technically they don't die.)
Estimating the Survival Curve
- I am inspired here by Cam Davidson-Pilon and his work on Lifelines. We define in the language of Survival analysis the 'censoring' event to be when 'Churn?' is observed. This can be defined in different ways for different industries. But let us assume we have a well-defined definition of Churn. Which in the case of Telcommunication data is probably 'stopping paying for the service'. In other industries like for example Cloud services - it might be harder, but then you can define Churn as 'to stop using a service for 30 days'.
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
T = churn_df["Account Length"]
C = churn_df["Churn?"]
kmf.fit(T, event_observed=C )
C
kmf.survival_function_
So we can say something like after 200 days about 50% of our distribution have churned. And after 243 days about 75% have churned. I don't have any experience with the telecoms industry but this would corrobrate my preconceptions.
Plots
The next interesting thing that the survival analysis library can do is to plot a graph. So we'll do that
kmf.survival_function_.plot()
from matplotlib import pyplot as plt
import seaborn as sns
plt.title('Survival function of Telecommunication Customers');
%matplotlib inline
kmf.plot()
Are New Yorkers fickle?
As our final piece of analysis. We can ask the question - 'do customers in different states have different churn behaviour?'. This could be used as part of a customer sales strategy!
ax = plt.subplot(111)
NJ = (churn_df["State"] == "NJ")
kmf.fit(T[NJ], event_observed=C[NJ], label="New Jersey")
kmf.plot(ax=ax, ci_force_lines=True)
KS = (churn_df["State"] == "KS")
kmf.fit(T[KS], event_observed=C[KS], label="Kansas")
NY = (churn_df["State"] == "NY")
kmf.fit(T[NY], event_observed=C[NY], label="New York")
kmf.plot(ax=ax, ci_force_lines=True)
plt.ylim(0,1);
plt.title("Lifespans of customers in different states");
kmf.median_
So we see that the median user stays 168 days which is realistic