"Churn Rate" is a business term describing the rate at which customers leave or cease paying for a product or service. It's a critical figure in many businesses, as it's often the case that acquiring new customers is a lot more costly than retaining existing ones (in some cases, 5 to 20 times more expensive).

Understanding what keeps customers engaged, therefore, is incredibly valuable, as it is a logical foundation from which to develop retention strategies and roll out operational practices aimed to keep customers from walking out the door. Consequently, there's growing interest among companies to develop better churn-detection techniques, leading many to look to data mining and machine learning for new and creative approaches.

Predicting churn is particularly important for businesses w/ subscription models such as cell phone, cable, or merchant credit card processing plans. But modeling churn has wide reaching applications in many domains. For example, casinos have used predictive models to predict ideal room conditions for keeping patrons at the blackjack table and when to reward unlucky gamblers with front row seats to Celine Dion. Similarly, airlines may offer first class upgrades to complaining customers. The list goes on.

This is a post about modeling customer churn using Python.

One of the motivations for modelling this is to consider what would happen in a telecoms setting.

from __future__ import division
import pandas as pd
import numpy as np

churn_df = pd.read_csv('data/churn.csv', )
col_names = churn_df.columns.tolist()

print("Column names:")
print(col_names)

to_show = col_names[:6] + col_names[-6:]

print("\nSample data:")
churn_df[to_show].head(6)

Column names:
['State', 'Account Length', 'Area Code', 'Phone', "Int'l Plan", 'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls', 'Churn?']

Sample data:

Survival Analysis

One interesting way is to use Survival analysis. The basic idea of 'survival analysis' is to estimate using some fancy statistics the 'survival curve'. In the case of telecommunications this is when a subscriber will leave the service. In the case of HR - the question might be 'what characteristics do employees who leave our company have?'. There are many interesting applications of survival analysis

#Firstly let us look at the column headings.
churn_df.columns.tolist

<bound method Index.tolist of Index(['State', 'Account Length', 'Area Code', 'Phone', 'Int'l Plan', 'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls', 'Churn?'], dtype='object')>

#We noted above that Churn had string values. The lifelines library that we'll use doesn't like that and prefers 1 and 0.
# So we transform this.
d = {'True.':1, 'False.':0}
churn_df['Churn?'] = churn_df[['Churn?']].applymap(lambda x: d[x])

#Let us look again at the head of the data. 
churn_df.head()

We can see that some subscribers in this list have not churned -- censorship!

Before we dig in to the data, I first need to introduce the first mathematical creature in survival analysis: the survival function!

Survival function

Define an subsribers lifetime, defined as the time between when they first purchase the subsription and when they churned, as capital $T$. Let small $t$ represent number of days from when they were first subsribe that is, since there "birth". Then the survival function, $S(t)$, is defined as:

$$ S(t) = P(T > t ) $$ What is the probability that a randomly chosen individual from the population lasts longer than small t?

The survival curve actually gives as a perfect description of the lifespans of a population. But, its never given to us, we need to estimate it using the data at hand.

Kaplan-Meier estimate

IMO, the best way to estimate the survival function is using the Kaplan-Meier estimate. It's nonparametric, which means we don't assume the data follows any particular form:

$$\hat{S(t)} = \prod_{i=0}^t \left(1 - \frac{d_i}{n_i}\right), \;\; \text{for all $t$}$$

where $d_i$ are number of deaths at time $i$, and $n_i$ are the number of individuals in the population who are at risk of dieing. Note that the above formula is for a specific $t$: if we compute this estimate over all $t$, then we get a curve - we'll see this later.

This formula can be derived from the following logic:

$$P( T = 0 ) \approx \frac{d_0}{n_0}$$

$$ \Rightarrow P( T > 0 ) \approx \left(1 - \frac{d_0}{n_0} \right) $$

$$ P( T > 1 ) = P( T > 1 \;|\; T > 0 )P( T > 0 ) \\ \approx \left(1 - \frac{d_1}{n_1}\right)\left(1 - \frac{d_0}{n_0}\right)$$

and so on...

How are censored individuals dealt with in the Kaplan Meier estimate? They are still part of denominator (as they are at risk of dieing), but don't count into the numerator (as technically they don't die.)

Estimating the Survival Curve

I am inspired here by Cam Davidson-Pilon and his work on Lifelines. We define in the language of Survival analysis the 'censoring' event to be when 'Churn?' is observed. This can be defined in different ways for different industries. But let us assume we have a well-defined definition of Churn. Which in the case of Telcommunication data is probably 'stopping paying for the service'. In other industries like for example Cloud services - it might be harder, but then you can define Churn as 'to stop using a service for 30 days'.

from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()

T = churn_df["Account Length"]
C = churn_df["Churn?"]
kmf.fit(T, event_observed=C )

<lifelines.KaplanMeierFitter: fitted with 3333 observations, 2850 censored>

C

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    1
11    0
12    0
13    0
14    0
...
3318    0
3319    0
3320    1
3321    0
3322    1
3323    1
3324    0
3325    0
3326    0
3327    0
3328    0
3329    0
3330    0
3331    0
3332    0
Name: Churn?, Length: 3333, dtype: int64

kmf.survival_function_

So we can say something like after 200 days about 50% of our distribution have churned. And after 243 days about 75% have churned. I don't have any experience with the telecoms industry but this would corrobrate my preconceptions.

Plots

The next interesting thing that the survival analysis library can do is to plot a graph. So we'll do that

kmf.survival_function_.plot()
from matplotlib import pyplot as plt
import seaborn as sns
plt.title('Survival function of Telecommunication Customers');

%matplotlib inline
kmf.plot()

<matplotlib.axes._subplots.AxesSubplot at 0xaf0d25ac>

Are New Yorkers fickle?

As our final piece of analysis. We can ask the question - 'do customers in different states have different churn behaviour?'. This could be used as part of a customer sales strategy!

ax = plt.subplot(111)

NJ = (churn_df["State"] == "NJ")
kmf.fit(T[NJ], event_observed=C[NJ], label="New Jersey")
kmf.plot(ax=ax, ci_force_lines=True)
KS = (churn_df["State"] == "KS")
kmf.fit(T[KS], event_observed=C[KS], label="Kansas")
NY = (churn_df["State"] == "NY")
kmf.fit(T[NY], event_observed=C[NY], label="New York")
kmf.plot(ax=ax, ci_force_lines=True)

plt.ylim(0,1);
plt.title("Lifespans of customers in different states");

kmf.median_

168.0

So we see that the median user stays 168 days which is realistic

	State	Account Length	Area Code	Phone	Int'l Plan	VMail Plan	Night Charge	Intl Mins	Intl Calls	Intl Charge	CustServ Calls	Churn?
0	KS	128	415	382-4657	no	yes	11.01	10.0	3	2.70	1	False.
1	OH	107	415	371-7191	no	yes	11.45	13.7	3	3.70	1	False.
2	NJ	137	415	358-1921	no	no	7.32	12.2	5	3.29	0	False.
3	OH	84	408	375-9999	yes	no	8.86	6.6	7	1.78	2	False.
4	OK	75	415	330-6626	yes	no	8.41	10.1	3	2.73	3	False.
5	AL	118	510	391-8027	yes	no	9.18	6.3	6	1.70	0	False.

	State	Account Length	Area Code	Phone	Int'l Plan	VMail Plan	VMail Message	Day Mins	Day Calls	Day Charge	...	Eve Calls	Eve Charge	Night Mins	Night Calls	Night Charge	Intl Mins	Intl Calls	Intl Charge	CustServ Calls
0	KS	128	415	382-4657	no	yes	25	265.1	110	45.07	...	99	16.78	244.7	91	11.01	10.0	3	2.70	1
1	OH	107	415	371-7191	no	yes	26	161.6	123	27.47	...	103	16.62	254.4	103	11.45	13.7	3	3.70	1
2	NJ	137	415	358-1921	no	no	0	243.4	114	41.38	...	110	10.30	162.6	104	7.32	12.2	5	3.29	0
3	OH	84	408	375-9999	yes	no	0	299.4	71	50.90	...	88	5.26	196.9	89	8.86	6.6	7	1.78	2
4	OK	75	415	330-6626	yes	no	0	166.7	113	28.34	...	122	12.61	186.9	121	8.41	10.1	3	2.73	3

	KM-estimate
timeline
0	1.000000
1	0.999700
2	0.999399
3	0.999399
4	0.999399
5	0.999399
6	0.999399
7	0.999399
8	0.999399
9	0.999399
10	0.999399
11	0.999399
12	0.999097
13	0.998794
15	0.998794
16	0.998490
17	0.997881
18	0.997881
19	0.997576
20	0.997576
21	0.997270
22	0.997270
23	0.996657
24	0.995736
25	0.995429
26	0.995429
27	0.995121
28	0.994812
29	0.994503
30	0.994503
...	...
184	0.555625
185	0.555625
186	0.555625
188	0.546365
189	0.546365
190	0.546365
191	0.546365
192	0.546365
193	0.533356
194	0.533356
195	0.533356
196	0.533356
197	0.516151
199	0.516151
200	0.516151
201	0.496299
202	0.496299
204	0.496299
205	0.496299
208	0.468727
209	0.441155
210	0.441155
212	0.404392
215	0.404392
217	0.404392
221	0.404392
224	0.336993
225	0.252745
232	0.252745
243	0.252745

blogroll

social