Photo by janilson furtado on Unsplash

The hidden risk of using CTEs

More readable code, no strings attached?

Adrian Bednarz
5 min readSep 28, 2022

--

Common table expressions (or CTEs for short) represent result set of a query that can only live within a context of another query. They are often compared to materialized views or temporary tables — although they all are different SQL constructs, it just so happens that they all can be used in similar contexts.

Major benefit of CTEs allow SQL code to be more readable. In the world where DBT is a core technology, CTEs are more apparent than ever. Just look around the DBT docs and see how often CTEs are used. They really do fit into DBT model well. Oftentimes, refactoring query to a query that leverages CTEs is an intermediate step before identifying staging and intermediate models for fact tables.

They also have some useful features — for instance, you can make them recursive (something that one cannot do with a subquery). But we won’t go down that route in this post.

A database take

Before we get into DBT, let’s step back and explore the world of CTEs as we know it from relational database world. We can think of them as optimization constructs. Imagine a query that references the same subquery multiple times. It is more efficient to calculate the result once, cache it and reuse it within a query. Sure, non-determinism, transactions and other crap can get into our way but that’s a general mental model that one might have. For instance, in Postgres 12 this is exactly how (deterministic) CTEs behave (this is especially useful with extensions like FDW — where the CTE can be calculated fully on remove Postgres server).

On the other hand SQL Server would just inline them before query planning. This is not necessarily a bad strategy too — modern query optimizers should be able to select an optimal physical execution plan too.

In general, their main purpose from coder perspective is to simplify query and deduplicate logic. Selecting candidates for CTEs, materialized views, temporary tables and naming them correctly is more an art than a science. With a lot of random CTEs it is difficult to keep track of column transformations, tables being joined etc. — a good rule of a thumb is to start from a fact table and see what dimension tables do you need in that query. It so happens that the same fact table may be present in many joins within a lengthy query.

Keep in mind that CTEs cannot be indexed. They usually use the indexes from underlying tables — you can be smart about it. If you need an index, consider materializing tables.

With data warehouses, we no longer define indexes. We leverage MPP capabilities and process huge amounts of data — that oftentimes have to be partitioned optimally so that the queries don’t take ages to run. As a Data Engineer it is your job to know the warehouse you use and its capabilities (like clustering keys in Snowflake) to build performant and robust systems. We rely on query planner, compute power and local data caching strategies to come up with an optimal execution plan and conduct it. Warehouses usually use a hybrid strategy — materializng and inlining CTEs as they see fit.

An innocent example

Some things may not be optimized out completely though. Let’s go to the DBT docs and look at one of the examples. I’ll copy it over for you

I really like SQL decomposition used in DBT examples. Sure, they are well thought, CTEs have descriptive names and they usually follow linear flow (something that might not be that common in real 2000+ lines of SQL scripts that evolved over the years). Lengthy SQLs with a lot of intertwined CTEs are hard to follow but having well-named intermediate models is something that can help you tackle this. In general, we usually try to follow certain rules while structuring queries in DBT

  1. Declare import statements like import_orders — note that there is no ref statement anywhere else in the query. By opening SQL file you can immediately tell what are the direct predecessors in the lineage graph.
  2. Build local intermediate models based off imported models.
  3. Create a final model — using intermediate CTEs.
  4. SELECT * FROM final

Let’s reconstruct this example and run it against Snowflake

In this simplistic example, we import a numbers_table , create a few intermediate models and combine them into a final model. Execution of this code takes 12s for me on XS warehouse. Now, lets try inlining imports!

Surprise, surprise! This code executes in 6 seconds. One CTE and 2x performance degradation. But why? The way Snowflake optimizer tackles this query is to gather all conditions from all the subqueries, join them with OR operator and apply to imported table and then forward the elements to intermediate CTEs. This essentially means that we need one extra scan over the data. If you have doubts if SELECT * FROM final have similar problems then the answer is no :)

If you found out that you are susceptible to this, the simplest fix is to inline references to your imported models. Another option is to wait until the Snowflake team fixes the problem.

Conclusion

We saw the many benefits CTEs bring us and we understood that their usage is at the core of DBT. We explored the world of the relational databases, the purpose of CTEs and how different engines may handle them. Including the limitations and unique features (recursion). Finally we touched the data warehouse landscape and have shown that sometimes the slight improvements at readability level may yield significant performance degradation.

--

--

Adrian Bednarz
Adrian Bednarz

Written by Adrian Bednarz

Staff Data Engineer — AI / Big Data / ML

No responses yet