Manage complexity in DBT projects
SQL code is just the beginning…
More and more companies are jumping on DBT train. Just by looking at https://www.moderndatastack.xyz/company/dbt you can see that DBT is a real thing in today’s world. I am in touch with lead engineers / CTOs of the companies I consulted in the past and majority of them love DBT — this has to mean something.
I am convinced that many more companies will use DBT in the near future (unless recession turns out to be a real, real thing). Using something is not the end of story — there are a lot of benefits that come from using DBT for sure but I started to wonder if such setup really scales with the amount of possible roadblocks along the way of building a robust and manageable data platform.
In this article, we are going to explore possible sources of complexity in DBT projects and try to propose a solutions to them.
Who the f### wrote that code?
You see, I worked with a lot of engineers. We spend a lot of time with technology, reading on latest trends, trying things out — and we still make silly mistakes. Even the 10x engineers do. We know the value of proper code testing — despite the fact that most of the test code is not ideal, is not testing the thing we want, is not robust enough, we still advocate for them. If things fail miserably in production, we add regression tests.
With data engineers, you don’t have to try really hard to convince them that imposing a strong schema on your data is a good thing and will help you in the long run. Just as you don’t have to convince majority of engineers that allowing remote code execution is rather a smelly idea.
And now, we enter the world where engineers build platforms and we let the other folks run their SQL freely. We chose SQL for very valid reason — to give access to the platform for more people. In the future (that may never come but hey, the option is there) we may even introduce some hardcore optimizations to make the SQL code blazingly fast. After all, SQL is just a declarative language, it is well studied and there is a lot of freedom of turning SQL code into actual machine code.
I don’t think we fully forgot, I think we just neglect the idea that writing any code comes with a lot of complexity. Projects are growing over time. Software projects oftentimes get completely rewritten by experienced teams as the code becomes unmanageable. Good engineers are scarce resource, they often get bored or burned out. Some companies hire poor engineers on purpose, some are simply forced to.
DBT offers testing functionality — if you are an engineer, do you think YAML-based-framework is robust enough? Of course, you can always write custom SQL to test your code, you can use dbt-expectations and such. Would your manager be happy to do it, too?
How about unused code? Do you happily and confidently remove code from production? Or is it a mental burden? Think for a moment what is the simplest solution to that issue, the solution that comes to most peoples minds — just let the code stay there untouched (how do I know it? Ask junior engineers what would they rather do).
There are much more things to consider, specific warehouse optimizations (like Snowflake clustering keys and such), SQL optimizations, DBT best practices. It often looks easy in the eyes of an engineers but be mindful that it isn’t for everyone.
A solution to this? I see just one option — having regular engineer audits (be it as a form of code reviews) of the code. Perform trainings to raise the awareness. Walk your Analytics Engineers, and whoever else writes DBT models, through your way of thinking about testing SQLs, modularizing components. And lower your expectations — they likely won’t care as much as you do. And if they do — become acquainted with them, such people are gems and they can bring awareness to other folks too.
Rising costs of cloud
Marketing materials often claim that we live in the great world where storage and compute is cheap. In many companies I worked with there were instances of getting an unexpectedly large AWS bill due to a human error. In case of DBT and other tools often used alongside, this doesn’t have to be an accident.
Just consider a simple setup — Fivetran loading data from all the external systems your company uses, Snowflake as a storage and DBT to manage models. Pretty standard platform in 2022 I would say. We are soon to hit a peak sales season in e-commerce. You are working for a data-driven company, they just moved their stack to the cloud and are happy for the upcoming days. Your marketers just sent out a campaign to millions of customers and this generated billion additional events in your system over a span of a few days. Suddenly you realize that
- your Fivetran bill skyrocketed because they bill in per-row fashion. And some imports happen in over-normalized way — like Shopify. Snowflake / AWS bill raised too,
- you realized that many models in your DBT are not even used, yet they consume resources. You were able to recompute daily workflows in less than 24 hours so who cares? You stopped respecting your SLAs.
Your management is no longer as happy (systems can be expensive but they also have to be predictably expensive). Jeff is happy for sure.
Solution here is to be more explicit about your budget, of course this is not easy to predict but you at least should think about it. There is no doubt that such campaign would raise costs — just be sure not to underestimate that. This should make you think if you REALLY need to centralize all data within a company. After all, this might not really be worth both the effort, complexity and costs.
It takes more and more time to rebuild models…
On a similar note, by adding more and more models they naturally take more time to compute. But what if something unexpected happens? A huge, unexpected traffic. A DDOS attack. A bank run. A pandemic. dbt run
could get pretty lengthy… would you be able to debug it easily?
There are some great visualizations, linage graphs that are bright and colorful. But are they really that useful in reality, when you are under time pressure to get this fixed? DBT cloud, Airflow or whatever orchestrator you use will show you exactly what SQL is problematic. But what comes next? How to debug it quickly?
Fortunately we are working with SQL, so we put on our SQL debugging hat on and start digging. Query profiling, operator statistics analysis can reveal problems like exploding joins, inefficient pruning or too small warehouse. Still, such query is might take hours to run. The useful technique is to always have macros in place that reduce amount of data against dev
target as explained in docs. Still, I don’t think that bringing up platform to healthy state would be a burden of people that created the problem in the first place…
Who is maintaining all of this?
After a model is productionalized a support team will take over and they’ll be watching over model buliding process. I will not get into whether existing monitoring solutions are good enough. But what should a poor support engineer do if the model is failing? For code, we usually produce run books, configuration flags, documentation.
There will likely be no run book for a query. It will also likely not be configurable in any useful way. Fortunately DBT encourages to document things. There is a problem with any sort of documentation though — it becomes stale really quickly. Code is the best form of documentation is unfortunately often painfully true.
Following general good practices when writing queries (there are a lot of articles created by DBT team, including documentation), simply making SQL easier to understand and close collaboration between the teams can make this process less burdensome.
The hidden complexity of EL
Finally, as number of integrations grow we start to realize that there is a hidden cost to EL. In some sense I like to think that there is no such thing as ELT. We used to do ETL and now we do ETLT but we have little control over the first T — and this can backfire.
Imagine that you were using snapshot tables, a common feature of DBT. Snapshots can be used to revert back in time — these tables model changes over time in slowly changing dimensions. You have to be aware that they have limited schema migrations support. Can you spot the risk? You snapshot a table that is created by EL process, which you have no control over. Whenever schema changes in this process, your snapshots will could longer be valid.
This is not a big deal if you are working with a handful of sources but as their number grows this can become painful.
If you hit a wall in this area, maybe you could reconsider building a custom loading tool that would actually enforce a fixed schema over your data? Using existing SaaS products is often fun and quick, it just might not be the best design choice after all.
Final words
With SQL, I think we already went similar route — companies chose Python over more sophisticated languages to broaden their hiring horizons. After all, Python is easy to pick up by engineers, it is also used by non-engineers. We weren’t too far from letting non-engineers build backend systems. And we have to pay the price now. There is a lot of terrible code out there and this doesn’t help with managing complexity either. I claim that we are heading on similar route with Modern Data Stack if we are not careful enough.
Apart from code complexity, there is a lot of other factors to consider, some are similar to coding world problems, some are brand new. And man, I haven’t even considered how the whole AI / Machine Learning fits into this picture. After all, we usually prepare data for reporting or model training. And this forms just another layer of complexity on top, with its own set of problems… I wish you good luck in exploring this space further and build magnificent data platforms!