We can see that some subscribers in this list have not churned -- censorship!
Before we dig in to the data, I first need to introduce the first mathematical creature in survival analysis: the survival function!
Survival function
Define an subsribers lifetime, defined as the time between when they first purchase the subsription and when they churned, as capital $T$. Let small $t$ represent number of days from when they were first subsribe that is, since there "birth". Then the survival function, $S(t)$, is defined as:
$$ S(t) = P(T > t ) $$
What is the probability that a randomly chosen individual from the population lasts longer than small t?
The survival curve actually gives as a perfect description of the lifespans of a population. But, its never given to us, we need to estimate it using the data at hand.
Kaplan-Meier estimate
IMO, the best way to estimate the survival function is using the Kaplan-Meier estimate. It's nonparametric, which means we don't assume the data follows any particular form:
$$\hat{S(t)} = \prod_{i=0}^t \left(1 - \frac{d_i}{n_i}\right), \;\; \text{for all $t$}$$
where $d_i$ are number of deaths at time $i$, and $n_i$ are the number of individuals in the population who are at risk of dieing. Note that the above formula is for a specific $t$: if we compute this estimate over all $t$, then we get a curve - we'll see this later.
This formula can be derived from the following logic:
$$P( T = 0 ) \approx \frac{d_0}{n_0}$$
$$ \Rightarrow P( T > 0 ) \approx \left(1 - \frac{d_0}{n_0} \right) $$
$$ P( T > 1 ) = P( T > 1 \;|\; T > 0 )P( T > 0 ) \\ \approx \left(1 - \frac{d_1}{n_1}\right)\left(1 - \frac{d_0}{n_0}\right)$$
and so on...
How are censored individuals dealt with in the Kaplan Meier estimate? They are still part of denominator (as they are at risk of dieing), but don't count into the numerator (as technically they don't die.)
Estimating the Survival Curve
- I am inspired here by Cam Davidson-Pilon and his work on Lifelines.
We define in the language of Survival analysis the 'censoring' event to be when 'Churn?' is observed.
This can be defined in different ways for different industries. But let us assume we have a well-defined definition of Churn. Which in the case of Telcommunication data is probably 'stopping paying for the service'. In other industries like for example Cloud services - it might be harder, but then you can define Churn as 'to stop using a service for 30 days'.