- Published on
Advanced Shopify Analytics in your Data Warehouse
- Authors
- Name
- David Krevitt
- Link
- Data App Enthusiast
- Get in Touch
Welcome to Cacheworthy
This post is a vestige of this blog's previous life as CIFL (2015 - 2023), which covered building analytics pipelines for internal reporting. When migrating to Cacheworthy and a focus on data applications, we found this post too helpful to remove - hope you enjoy it!
Curious about data applications? Meet Cacheworthy.We love Shopify. But let's be honest, the standard reports are mediocre.
You can try exporting the data for deeper analysis, but if your brand is doing any sort of volume you'll need a more powerful solution than a spreadsheet or notebook.
That's where cloud data warehouses come in. Back when we were doing data consulting work, we were partial to Google Bigquery (because of its tight integration with the Google marketing stack), but warehouses provided by AWS (Redshift), Databricks, Snowflake, et al are also great.
Let's dive in!
- Our Top 5 Shopify Analysis Recipes
- One Big Table for Shopify Analytics
- 1. Cohort analysis
- 2. Buyer Segmentation (RFM Analysis)
- 3. Sales + Margin Forecasting
- 4. Basket Analysis
- 5. Customer Retention & Churn Modeling
- Standing up your BigQuery Warehouse
Our Top 5 Shopify Analysis Recipes
In the past, we helped clients setup dozens of custom reports in BigQuery - this post will detail the 5 of the most powerful Ecommerce analyses out there:
- Cohort analysis
- Buyer segmentation (aka RFM analysis)
- Sales + Margin Forecasting
- Basket analysis
- Retention & churn modeling
Before we dive into how you can run each of these in BigQuery, let's lay the foundation.
We like to use a "One Big Table" data modeling methodology to generate all of these analyses - that way, you can build a single foundational table, and stack the rest of your analytics on top of it.
One Big Table for Shopify Analytics
Say a buyer orders a pair of socks and a t-shirt.
Answering a few questions about that order tells you everything you need to know:
- What order sequence is this for the buyer (1, 2, 3...)?
- When was their first order?
- Have they bought these products before, or is this the first time?
- What marketing channel would you attribute this order to?
In SQL, you can tag each of your orders with these attributes, so that your base orders table contains all of the underlying info you need to perform any analysis.
We recommend building these types of analysis using the dbt data modeling framework, but any SQL authoring tool will do.
Let's dive into how the One Big Table translates into each of the 5 analysis recipes mentioned above:
1. Cohort analysis
Cohort analysis is just a fancy term for a pivot table, where you pivot based on a buyer's first order month (their "cohort").
To facilitate this, we tag each order with:
- The buyer's first purchase month
- Whether that buyer is New vs Existing
- The marketing channel that drove their first purchase, and this purchase
Having all of those attributes at your fingertips makes building cohort analysis charts very straightforward. There are a couple key concepts that'll allow you to get this done:
a. Joining Google Analytics (or other attribution data)
It's critical to know what marketing channel acquired this buyer, because this may result in wildly different lifetime behavior.
Buyers who first purchase from organic search may be stickier than those finding your store in Google shopping ads, or vice versa.
This lifetime behavior dynamic (aka cohort analysis) is key to knowing how your team's efforts are paying off, and what areas need attention.
Thankfully, BigQuery (or any cloud warehouse) makes it easy to pipe Google Analytics attribution data into it, and join them together based on transactionid
(GA) and order_number
(Shopify).
If you're looking to dive in, pick up this open source dbt project on GitHub, which performs this join of Google Analytics and Shopify data in BigQuery.
Once you've got those two datasets joined up, you'll be ready to tag your orders.
b. BigQuery window functions
Window (aka analytic) functions in BigQuery are the core of the power pack. They allow you to quickly tag an individual row (an order), with attributes from the rest of your dataset.
You can think about them like running an unlimited number of GROUP BYs from within the same query.
We use them to tag an order with a handful of attributes:
- First order number:
first_value(order_number) over (PARTITION BY customer_id ORDER BY order_number asc) as first_order_number
- First order date:
first_value(created_at) over (PARTITION BY customer_id ORDER BY order_number asc) as first_order_date
- First order channel:
first_value(channel) over (PARTITION BY customer_id ORDER BY order_number asc) as first_order_channel
- First order platform:
first_value(platform) over (PARTITION BY customer_id ORDER BY order_number asc) as first_order_platform
Having these attributes accessible across your entire order history makes generating any cohort analysis pivot table or chart extremely straightforward.
If you're curious to go a bit deeper technically on these fields and why they're important, check out this walkthrough:
And if you're interested in diving deeper into window functions, check out our free BigQuery tutorial.
2. Buyer Segmentation (RFM Analysis)
Who are your best customers, and where did they come from? This is the fundamental question we seek to answer with buyer segmentation.
It helps to give these segments names, so that your team has a shorthand to work from when discussing customer personas:
- Whale: Purchases frequently, spends generously
- One-time: Purchases once and never again
- Loyal: Purchases consistently at a moderate spend level
Hat tip to Drew Sanocki and Michael Epstein (co-founders at PostPilot) for introducing us to these definitions.
To codify these segments, we calculate RFM for each buyer over a given period (usually annual):
- Recency: How recently (in days) did they purchase?
- Frequency: How many times in the year did they purchase?
- Monetary value: How much did they spend?
Once you've calculated RFM for each buyer, we generally define segments based on a combination of percentile distributions of RFM and hardcoded attributes:
- Whales are in the top 90th percentile of monetary value
- Loyal buyers are in the top 80th percentile of frequency
- New buyers have a frequency of 1 and recency of < 30 days
- One-time buyers have a frequency of 1
If you're looking to take a deeper dive into buyer segmentation, check out this open source dbt project on GitHub that contains buyer segmentation SQL models.
3. Sales + Margin Forecasting
Forecasting doesn't require a machine learning algorithm for it to be useful - in fact, using a black box algo for forecasting can actually be more harmful than helpful.
That's because your strategy changes over time, in terms of merchandising and marketing mix. It's important to have control to dial a forecast to these anticipated changes.
Before we forecast, we always join a few datasets together to calculate key forecast inputs:
- Shopify orders: To roll up average buyer AOV + LTV by channel
- Google Analytics: For backfilling conversion rates by channel
- Ad spend data: To calculate average CPA by channel
We recommend joining these datasets together in BigQuery, but pulling data back down into Sheets to forecast.
Calculating projected revenues + margins
Ultimately, your revenue equals traffic * conversion rate * AOV
.
You can further break it down into what we call marketing margin: traffic * conversion rate * ( AOV - CPA )
.
This will tell you how much space you have to increase ad spend, or go after more expensive paid channels.
So once we've calculated these on a backward-looking basis, how do we forecast forward? It really depends on the business, but generally we'll allow for dialing each of those inputs:
- Organic traffic growth rate
- Paid monthly spend
- Paid CPC
- Conversion rate
- AOV
Outside of organic traffic growth rate (which needs to be multiplied by your current base traffic), the rest of these are static variables - you can dial them to your expectations, and see how those dials impact your revenue + margin picture going forward.
Generally we'll wire up an 'Assumptions' box within a Sheets forecast (or whatever your reporting tool is), that displays suggested assumptions based on the last 3-12 months of history.
Forecasting is so specific to your business + toolbelt, so we haven't yet built a general-purpose template for it - but always feel free to reach out if you're curious to build a forecast.
4. Basket Analysis
You know those product recommendations on Amazon? People who bought this also bought...
There's no reason you can't do that same analysis for your own shop in SQL.
These basket recommendations are extremely powerful when deployed in cross-sell and upsell messaging - whether that's on your site, in drip email campaigns, or as display ads to custom audiences.
How do we perform market basket analysis? It's ultimately pretty simple - we join every order line item for a buyer to every other order line item. The rough SQL looks like so:
SELECT
a.customer_id,
a.order_number,
a.product_id,
a.product_name,
a.final_price,
b.product_id complement_product_id,
b.product_name complement_product_name,
b.final_price complement_final_price
FROM agg_transaction_line_item_type a
LEFT JOIN agg_transaction_line_item_type b
ON (
a.store = b.store AND
a.order_number = b.order_number AND
a.product_id != b.product_id
)
Once you have order line items joined together, you can pivot out product * complement product
, to get a sense of which products are purchased together.
This is also really helpful when done at the product type level, or at the buyer lifetime rather than individual basket level.
You can also calculate lift at the product combination level (essentially the additional $ spent on the combination of two or more products, versus them being purchased alone).
What we've found with lift calculations though, that the outcomes aren't always relevant - even if a niche combination of products had a relatively high lift on a dollar basis, the sample size may be so small that it's not actionable in real life.
Beware of small sample sizes when performing basket analysis.
5. Customer Retention & Churn Modeling
Shopify outputs a list of orders, out of which you'll need to derive whether a buyer churned or repeated.
The date window (monthly, quarterly, annually) over which you measure retention will depend, of course, on the products you sell.
Retention modeling is a bit of an odd duck versus these other metrics. It requires bootstrapping a monthly (or weekly / annual, whatever your window is) time series of each of your buyers' history.
Rather than just count up the months in which they had orders, we need to also fill in the gaps of where they didn't order (to count them as churn):
- Month 1: 2 orders for $97
- Month 2: 0 orders
- Month 3: 0 orders
- Month 4: 1 orders for $45
The long and short of how this is done - join your orders data to a date series (we use GENERATE_DATE_ARRAY
in BigQuery for this, see here in the example dbt project).
This allows you to tag each month for that buyer with a retention / churn status:
- Month 1: New
- Month 2: Churned
- Month 3: Dormant
- Month 4: Reactivated
Substitute 'month' for quarter / year there, based on the cadence you expect buyers to need your product (see this example model in the dbt project for an example of year-over-year retention).
Again, generating this type of time series is very difficult to do in Sheets, but in BigQuery can be boiled down to a few lines of SQL, if applied correctly. For a deeper tutorial, check out this Intro to Calculating Customer Retention in SQL post, which dives into the math above in detail.
Standing up your BigQuery Warehouse
If you're looking to fire up these analysis recipes for your own Shopify store(s), the example dbt project we put together is the best place to start.
That repo includes a link to copy a data pipeline project plan, which walks through each step required for setup (from ingesting Shopify data into your warehouse, through to building reports in Google / Looker Data Studio).
Enjoy!