- Published on
Modeling Customer Retention and Churn in SQL
- Authors
- Name
- David Krevitt
- Link
- Data App Enthusiast
- Get in Touch
Welcome to Cacheworthy
This post is a vestige of this blog's previous life as CIFL (2015 - 2023), which covered building analytics pipelines for internal reporting. When migrating to Cacheworthy and a focus on data applications, we found this post too helpful to remove - hope you enjoy it!
Curious about data applications? Meet Cacheworthy.Retention and its sister metric, churn, seem tantalizing simple to calculate:
Retention = Returning customers this period / Customers last period
Churn = 100% - Retention
Easy right?
But when it comes time to roll up your retention + churn rates each monthend, you quickly realize that you have a problem:
Customers who didn't repeat purchase this month, by definition have no data.
Welcome to the core problem of calculating retention + churn...you must first generate data to fill the gaps of what each customer of your business didn't do in a given period.
Allow me to show you 3-step method for filling these gaps in SQL. This methodology is relevant for both calcualting retention for a SaaS business, and rebuy rates for Ecommerce brands.
This method demonstrates weekly retention rates / churn (useful for 'weekly active' retention in a SaaS product), but can easily adapted to monthly, quarterly or annual retention rates.
The code snippets below are written in BigQuery standard SQL, but will work with slight modification in any flavor of SQL.
- Step 0: Generate a Date Range
- Step 1: Generate Each Customer's Individual Time Series
- Step 2: Calculate Lifetime + Previous Period Behavior
- Step 3: Translate Behavior into Retention or Churn
- Putting it All Together
If you prefer video, here's a 10-minute walkthrough of the topic - code snippets mentioned in the video are listed down the page.
Step 0: Generate a Date Range
To fill in the gaps in each customer's history, first we need a time series to work from.
In BigQuery, we use the GENERATE_DATE_ARRAY
function - this statement will generate a series of dates from 1/1/2020 until the current date.
GENERATE_DATE_ARRAY(DATE('2020-01-01'), CURRENT_DATE(), INTERVAL 1 DAY)
Taking this a step further, you can use this array to generate an array of week start dates (Sundays or Mondays) or month start / end dates.
These snippets generate these weekly / monthly date series - you can modify the hardcoded start date (currently '2019-01-01') to whatever your business requires:
Generate Week Start Date Array
SELECT
date_in_range,
day_number,
week_number,
min(date_in_range) over (partition by week_number) week_start,
max(date_in_range) over (partition by week_number) week_end
FROM (
SELECT
date_in_range,
date_diff(date_in_range, cast('2019-01-01' as date), DAY)+1 as day_number,
cast(trunc(date_diff(date_in_range, cast('2019-01-01' as date), DAY)/7)+1 as int64) as week_number
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2019-01-01'), CURRENT_DATE(), INTERVAL 1 DAY)
) AS date_in_range
)
Generate Monthend Date Array
SELECT
date_in_range,
date_in_range_bom,
date_in_range_bom_mom,
date_in_range_yoy
FROM
(
SELECT
date_in_range,
date_in_range_bom,
date_sub(date_in_range_bom, INTERVAL 1 MONTH) date_in_range_bom_mom,
date_sub(date_in_range, INTERVAL 1 YEAR) date_in_range_yoy,
first_value(date_in_range) over (partition by yyyymm order by date_in_range desc) monthend_date_in_range
FROM
(
SELECT
date_in_range,
date_trunc( date_in_range, MONTH) date_in_range_bom
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2020-01-31'), date_add(CURRENT_DATE(), INTERVAL 31 DAY), INTERVAL 1 DAY)
) AS date_in_range
)
)
WHERE date_in_range = monthend_date_in_range
Step 1: Generate Each Customer's Individual Time Series
Now that we've bootstrapped the date series that we want to fill in for each customer, the next step is to join them together with a straight JOIN.
This will generate a series of dates for each customer - you can calculate which transactions belong to each period with a simple CASE statement:
with date_spine as (
SELECT * FROM the_date_array_you_just_created
),
transactions as (
SELECT
user_id,
transaction_date,
transactions_created,
FROM your_transaction_table
)
SELECT
user_id,
week_start,
week_end,
week_number,
sum(CASE WHEN date >= week_start AND transaction_date <= week_end THEN transactions_created ELSE 0 END) as transactions_created
FROM date_spine
JOIN transactions
GROUP BY user_id, week_start, week_end, week_number
The CASE statement is really the key there - it only counts the customer's transactions if they belong to the relevant week (or month, quarter, year, etc) period.
There are many many ways to perform this join, but the important thing is to have date ranges on the left, so that each user_id has an entry for each relevant time period. We're calculating retention at the user
level here, but this can be whatever your appropriate entity is (user, organization, account, etc).
You could get fancy and restrict the join at this level to only dates following their first transaction, rather than later on - implementation is up to you! Recommend reading through this entire approach first, and then adapting it to your situation.
Step 2: Calculate Lifetime + Previous Period Behavior
To calculate retention, we need to know three things:
- Is this a buyer's first active period?
- Did they fulfill the desired behavior (buying, logging in, maintaining a subscription, etc) in the prior period?
- Did they "" in the current period?
We already know what they did in the current period (from the previous step of the query).
Calculating Lifetime Buyer Behavior
To answer if this is their first active period, we need to know their lifetime count of transactions. If this period's total = their lifetime total, then generally they'll be a new customer in this period - in which case we'll want to separate them out of the retention calculation.
We can calculate this by using a window (aka analytic) function, which allows us to quickly map out each buyer's lifetime behavior:
sum(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_lifetime
Notice the PARTITION BY parameter there is strictly user_id - this is analogous to GROUP BY in a regular query.
Calculating Previous Period Buyer Behavior
We can use a similar window function to pluck out that customer's previous period behavior:
lag(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_prev
The LAG and LEAD functions will allow you to iterate over the specified partition. This window generates a series of each user's behavior, ordered in ascending order.
So if we're lagging from the current row, we'll pull in their previous week's transactions.
Put it all together, and the query looks like so:
SELECT
user_id,
week_start,
week_end,
week_number,
transactions_created,
sum(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_lifetime,
lag(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_prev
FROM ( {{ step 1 query }} )
We can use lag
and lead
here, because we've already constructed a full time series for each user.
In the 3rd (and final) step of this query, we'll map out how this period / last period / lifetime behavior translates into retention rates.
Step 3: Translate Behavior into Retention or Churn
Now that we've got the table set, it's time to actually dig into calculating retention rates.
We do this using four CASE statements, to flag a buyer as either one of four buckets:
- New: this period = lifetime
- Churned: this period = 0, previous period > 0
- Retained: this period > 0, previous period > 0
- Reactivated: this period > 0, previous period = 0, this period < lifetime
Those status flags map out into CASE statements in a query like so:
SELECT
user_id,
week_start,
week_end,
week_number,
CASE WHEN transactions_created = transactions_created_lifetime and transactions_created > 0 THEN 1 ELSE 0 END as new_flag,
CASE WHEN transactions_created_prev > 0 AND transactions_created = 0 THEN 1 ELSE 0 END as churns_flag,
CASE WHEN transactions_created > 0 AND transactions_created_prev = 0 and transactions_created != transactions_created_lifetime THEN 1 ELSE 0 END as reactivations_flag,
CASE WHEN transactions_created > 0 AND transactions_created_prev > 0 THEN 1 ELSE 0 END as retentions_flag
FROM {{ step 2 query }}
You could combine these 4 CASE statements into one 'retention_status' CASE statement, but we prefer separating them out as individual flag fields.
This allows us to use simple calculated fields in your reporting tool of choice to calculate retention rates:
sum(retentions_flag) / (sum(retentions_flag) + sum(churns_flag))
You could also include reactivations in the denominator there - if more people reactivated vs churned, you could have net retention greater than 100% for the period.
And to the contrary, for our sister metric churn:
sum(churns_flag) / (sum(retentions_flag) + sum(churns_flag))
Separating out each user into one of these four buckets (new, retained, churned, reactivated) also makes generating cohort analysis charts very simple.
Putting it All Together
If you're ready to dive into a full retention rate query, here you go! This could easily be adapted to run with CTEs (common table expressions) rather than nested SQL, if that's your bag:
SELECT
user_id,
week_start,
week_end,
week_number,
CASE WHEN transactions_created = transactions_created_lifetime and transactions_created > 0 THEN 1 ELSE 0 END as new_flag,
CASE WHEN transactions_created_prev > 0 AND transactions_created = 0 THEN 1 ELSE 0 END as churns_flag,
CASE WHEN transactions_created > 0 AND transactions_created_prev = 0 and transactions_created != transactions_created_lifetime THEN 1 ELSE 0 END as reactivations_flag,
CASE WHEN transactions_created > 0 AND transactions_created_prev > 0 THEN 1 ELSE 0 END as retentions_flag
FROM (
SELECT
user_id,
week_start,
week_end,
week_number,
transactions_created,
sum(transactions_created) over weekly as transactions_created_lifetime,
lag(transactions_created) over weekly as transactions_created_prev
FROM (
SELECT
user_id,
week_start,
week_end,
week_number,
sum(CASE WHEN date >= week_start AND date <= week_end THEN transactions_created ELSE 0 END) as transactions_created,
FROM date_spine
JOIN transactions
ON date_spine.site = transactions.site
GROUP BY user_id,
week_start,
week_end,
week_number
)
WINDOW weekly as (PARTITION BY user_id ORDER BY week_start asc)
)
To sum it up, we calculate retention by writing a query in 3 layers:
- Generate a date series for each user
- Use window functions to calculate lifetime and previous period behavior
- Use CASE statements to translate this period / previous period / lifetime behavior into a retention status for that period
To modify these snippets for any time period, just swap out the weekly variables (week_start, week_end, etc) with the columns present in your monthly, quarterly or annual date date spine from Step 0 above.