The hidden risk of using CTEs

A database take

Before we get into DBT, let’s step back and explore the world of CTEs as we know it from relational database world. We can think of them as optimization constructs. Imagine a query that references the same subquery multiple times. It is more efficient to calculate the result once, cache it and reuse it within a query. Sure, non-determinism, transactions and other crap can get into our way but that’s a general mental model that one might have. For instance, in Postgres 12 this is exactly how (deterministic) CTEs behave (this is especially useful with extensions like FDW — where the CTE can be calculated fully on remove Postgres server).

On the other hand SQL Server would just inline them before query planning. This is not necessarily a bad strategy too — modern query optimizers should be able to select an optimal physical execution plan too.

In general, their main purpose from coder perspective is to simplify query and deduplicate logic. Selecting candidates for CTEs, materialized views, temporary tables and naming them correctly is more an art than a science. With a lot of random CTEs it is difficult to keep track of column transformations, tables being joined etc. — a good rule of a thumb is to start from a fact table and see what dimension tables do you need in that query. It so happens that the same fact table may be present in many joins within a lengthy query.

Keep in mind that CTEs cannot be indexed. They usually use the indexes from underlying tables — you can be smart about it. If you need an index, consider materializing tables.

With data warehouses, we no longer define indexes. We leverage MPP capabilities and process huge amounts of data — that oftentimes have to be partitioned optimally so that the queries don’t take ages to run. As a Data Engineer it is your job to know the warehouse you use and its capabilities (like clustering keys in Snowflake) to build performant and robust systems. We rely on query planner, compute power and local data caching strategies to come up with an optimal execution plan and conduct it. Warehouses usually use a hybrid strategy — materializng and inlining CTEs as they see fit.

An innocent example

Some things may not be optimized out completely though. Let’s go to the DBT docs and look at one of the examples. I’ll copy it over for you

with import_orders as (
    
    select * from {{ ref('orders') }}

),
aggregate_orders as (

    select

        customer_id,
        count(order_id) as count_orders

    from import_orders
    where status not in ('returned', 'return pending')
    group by 1

),
segment_users as (
    
    select

        *,
        case 
            when count_orders >= 3 then 'super_buyer'
            when count_orders <3 and count_orders >= 2 then 
                'regular_buyer'
            else 'single_buyer'
        end as buyer_type
    
    from aggregate_orders

)
select * from segment_users

I really like SQL decomposition used in DBT examples. Sure, they are well thought, CTEs have descriptive names and they usually follow linear flow (something that might not be that common in real 2000+ lines of SQL scripts that evolved over the years). Lengthy SQLs with a lot of intertwined CTEs are hard to follow but having well-named intermediate models is something that can help you tackle this. In general, we usually try to follow certain rules while structuring queries in DBT

Declare import statements like import_orders — note that there is no ref statement anywhere else in the query. By opening SQL file you can immediately tell what are the direct predecessors in the lineage graph.
Build local intermediate models based off imported models.
Create a final model — using intermediate CTEs.
SELECT * FROM final

Let’s reconstruct this example and run it against Snowflake

CREATE OR REPLACE TEMPORARY TABLE numbers_table AS
  SELECT SEQ4() AS num
    FROM TABLE(GENERATOR(ROWCOUNT => 1000000000));
    WITH numbers AS (
    SELECT * FROM numbers_table
),
num_0 AS (
    SELECT * FROM numbers WHERE num % 3 = 0 AND num % 10 = 0
),
num_1 AS (
    SELECT * FROM numbers WHERE num % 3 = 1 AND num % 10 = 1
),
num_2 AS (
    SELECT * FROM numbers WHERE num % 3 = 2 AND num % 10 = 2
), 
final AS (
    SELECT * FROM num_0
     UNION ALL
    SELECT * FROM num_1
     UNION ALL
    SELECT * FROM num_2
)
SELECT * FROM final;

In this simplistic example, we import a numbers_table , create a few intermediate models and combine them into a final model. Execution of this code takes 12s for me on XS warehouse. Now, lets try inlining imports!

WITH num_0 AS (
    SELECT * FROM numbers_table WHERE num % 3 = 0 AND num % 10 = 0
),
num_1 AS (
    SELECT * FROM numbers_table WHERE num % 3 = 1 AND num % 10 = 1
),
num_2 AS (
    SELECT * FROM numbers_table WHERE num % 3 = 2 AND num % 10 = 2
), 
final AS (
    SELECT * FROM num_0
     UNION ALL
    SELECT * FROM num_1
     UNION ALL
    SELECT * FROM num_2
)
SELECT * FROM final;

Surprise, surprise! This code executes in 6 seconds. One CTE and 2x performance degradation. But why? The way Snowflake optimizer tackles this query is to gather all conditions from all the subqueries, join them with OR operator and apply to imported table and then forward the elements to intermediate CTEs. This essentially means that we need one extra scan over the data. If you have doubts if SELECT * FROM final have similar problems then the answer is no :)

If you found out that you are susceptible to this, the simplest fix is to inline references to your imported models. Another option is to wait until the Snowflake team fixes the problem.

Conclusion

We saw the many benefits CTEs bring us and we understood that their usage is at the core of DBT. We explored the world of the relational databases, the purpose of CTEs and how different engines may handle them. Including the limitations and unique features (recursion). Finally we touched the data warehouse landscape and have shown that sometimes the slight improvements at readability level may yield significant performance degradation.

The hidden risk of using CTEs

More readable code, no strings attached?

A database take

An innocent example

Conclusion

Written by Adrian Bednarz

No responses yet