The hidden risk of using CTEs
More readable code, no strings attached?
Common table expressions (or CTEs for short) represent result set of a query that can only live within a context of another query. They are often compared to materialized views or temporary tables — although they all are different SQL constructs, it just so happens that they all can be used in similar contexts.
Major benefit of CTEs allow SQL code to be more readable. In the world where DBT is a core technology, CTEs are more apparent than ever. Just look around the DBT docs and see how often CTEs are used. They really do fit into DBT model well. Oftentimes, refactoring query to a query that leverages CTEs is an intermediate step before identifying staging and intermediate models for fact tables.
They also have some useful features — for instance, you can make them recursive (something that one cannot do with a subquery). But we won’t go down that route in this post.
A database take
Before we get into DBT, let’s step back and explore the world of CTEs as we know it from relational database world. We can think of them as optimization constructs. Imagine a query that references the same subquery multiple times. It is more efficient to calculate the result once, cache it and reuse it within a query. Sure, non-determinism, transactions and other crap can get into our way but that’s a general mental model that one might have. For instance, in Postgres 12 this is exactly how (deterministic) CTEs behave (this is especially useful with extensions like FDW — where the CTE can be calculated fully on remove Postgres server).
On the other hand SQL Server would just inline them before query planning. This is not necessarily a bad strategy too — modern query optimizers should be able to select an optimal physical execution plan too.
In general, their main purpose from coder perspective is to simplify query and deduplicate logic. Selecting candidates for CTEs, materialized views, temporary tables and naming them correctly is more an art than a science. With a lot of random CTEs it is difficult to keep track of column transformations, tables being joined etc. — a good rule of a thumb is to start from a fact table and see what dimension tables do you need in that query. It so happens that the same fact table may be present in many joins within a lengthy query.
Keep in mind that CTEs cannot be indexed. They usually use the indexes from underlying tables — you can be smart about it. If you need an index, consider materializing tables.
With data warehouses, we no longer define indexes. We leverage MPP capabilities and process huge amounts of data — that oftentimes have to be partitioned optimally so that the queries don’t take ages to run. As a Data Engineer it is your job to know the warehouse you use and its capabilities (like clustering keys in Snowflake) to build performant and robust systems. We rely on query planner, compute power and local data caching strategies to come up with an optimal execution plan and conduct it. Warehouses usually use a hybrid strategy — materializng and inlining CTEs as they see fit.
An innocent example
Some things may not be optimized out completely though. Let’s go to the DBT docs and look at one of the examples. I’ll copy it over for you
with import_orders as (
select * from {{ ref('orders') }}
),
aggregate_orders as (
select
customer_id,
count(order_id) as count_orders
from import_orders
where status not in ('returned', 'return pending')
group by 1
),
segment_users as (
select
*,
case
when count_orders >= 3 then 'super_buyer'
when count_orders <3 and count_orders >= 2 then
'regular_buyer'
else 'single_buyer'
end as buyer_type
from aggregate_orders
)
select * from segment_users
I really like SQL decomposition used in DBT examples. Sure, they are well thought, CTEs have descriptive names and they usually follow linear flow (something that might not be that common in real 2000+ lines of SQL scripts that evolved over the years). Lengthy SQLs with a lot of intertwined CTEs are hard to follow but having well-named intermediate models is something that can help you tackle this. In general, we usually try to follow certain rules while structuring queries in DBT
- Declare import statements like
import_orders
— note that there is noref
statement anywhere else in the query. By opening SQL file you can immediately tell what are the direct predecessors in the lineage graph. - Build local intermediate models based off imported models.
- Create a final model — using intermediate CTEs.
SELECT * FROM final
Let’s reconstruct this example and run it against Snowflake
CREATE OR REPLACE TEMPORARY TABLE numbers_table AS
SELECT SEQ4() AS num
FROM TABLE(GENERATOR(ROWCOUNT => 1000000000));
WITH numbers AS (
SELECT * FROM numbers_table
),
num_0 AS (
SELECT * FROM numbers WHERE num % 3 = 0 AND num % 10 = 0
),
num_1 AS (
SELECT * FROM numbers WHERE num % 3 = 1 AND num % 10 = 1
),
num_2 AS (
SELECT * FROM numbers WHERE num % 3 = 2 AND num % 10 = 2
),
final AS (
SELECT * FROM num_0
UNION ALL
SELECT * FROM num_1
UNION ALL
SELECT * FROM num_2
)
SELECT * FROM final;
In this simplistic example, we import a numbers_table
, create a few intermediate models and combine them into a final model. Execution of this code takes 12s for me on XS warehouse. Now, lets try inlining imports!
WITH num_0 AS (
SELECT * FROM numbers_table WHERE num % 3 = 0 AND num % 10 = 0
),
num_1 AS (
SELECT * FROM numbers_table WHERE num % 3 = 1 AND num % 10 = 1
),
num_2 AS (
SELECT * FROM numbers_table WHERE num % 3 = 2 AND num % 10 = 2
),
final AS (
SELECT * FROM num_0
UNION ALL
SELECT * FROM num_1
UNION ALL
SELECT * FROM num_2
)
SELECT * FROM final;
Surprise, surprise! This code executes in 6 seconds. One CTE and 2x performance degradation. But why? The way Snowflake optimizer tackles this query is to gather all conditions from all the subqueries, join them with OR
operator and apply to imported table and then forward the elements to intermediate CTEs. This essentially means that we need one extra scan over the data. If you have doubts if SELECT * FROM final
have similar problems then the answer is no :)
If you found out that you are susceptible to this, the simplest fix is to inline references to your imported models. Another option is to wait until the Snowflake team fixes the problem.
Conclusion
We saw the many benefits CTEs bring us and we understood that their usage is at the core of DBT. We explored the world of the relational databases, the purpose of CTEs and how different engines may handle them. Including the limitations and unique features (recursion). Finally we touched the data warehouse landscape and have shown that sometimes the slight improvements at readability level may yield significant performance degradation.