How would you test your data pipelines?
A look at the current state of testing in the data industry, exploring the need for a more robust approach to ensure reliability
DBT and modern data platforms are currently popular topics in the data industry (although some may find AI and ChatGPT more alluring). Personally, I wish people placed equal emphasis on testing and ensuring platform reliability as they do on exploring new technologies. Based on my experience as a software developer, I’ve worked on various projects with different testing methodologies, but tests have always been a crucial part of the process.
It was surprising to me that when I began working with data teams, I discovered a completely different landscape. There was a lack of source control and testing. SQL was used for everything, and having thousands of lines of SQL code per model was considered acceptable. Additionally, untested stored procedures were being heavily utilized in every way possible. All of this was quite shocking to me.
Although DBT has brought some changes, I believe they are relatively minor. There are many other areas that companies tend to neglect, which require more attention, testing, and care. Let’s delve into why this might be the case.
The lack of frameworks
DBT comes with a built-in testing framework, but for serious data architects, it may be too limited in its capabilities. While you can write tests at the column or model level, these may not be enough, even with plugins like dbt-expectations, in my opinion, because:
- Naive implementation of tests would only run assertions against your production database, which is not designed to check the correctness of the model’s logic.
- These tests are usually run infrequently (e.g., every day or during model deployment), so if your data changes throughout the day, you are unaware of any issues.
- For data teams just starting with DBT, they may not be aware of how bad their data is without any quality and governance. At the beginning of our work together, any test that fit the DBT testing model would fail. It would be useful to have tests that raise warnings rather than failures. Although DBT supports
severity
, many people are unaware of this and remain oblivious of potential solution. - I have not found any well-rounded testing approach for DBT that allows teams to define tests, mock data, and have a separate set of live data tests that run every few minutes or with every input data change to ensure that our data is sound.
Due to these and many other reasons, teams I have worked with usually end up creating their own custom solutions to the testing problem. Some offload the problem to external vendors, such as Talend. Building your own platform can be fun in the beginning, but people often fail to consider the long-term consequences, such as support, security, development, and patching. These all usually require a team of developers to handle, which can be problematic for teams with only data experts and no software engineers. Given that many people change careers over time, it’s likely to find someone with expertise in software, data, infrastructure and even ML. I wouldn’t put any money on a fact that such people will stay with the company for too long and that the company would have easy time finding a replacement. Therefore, I believe the best way to address this is to build a supportive testing community that can work together to build valuable plugins — which is precisely what DBT was envisioned to be
The community is starting to fight back and propose solutions. Take a look at the method proposed in this article. It is common advice from people to use a combination of macros and tags to mock out your real sources for testing purposes. Unfortunately, that approach has a major flaw — you can technically only run a single test case per model logic. You have just one set of mocked inputs and you expect a single output. Sure, you can argue that writing small models is beneficial for readability and maintainability, and that approach forces you to write many small models. However, to test a single CASE WHEN statement properly, you need at least two sets of inputs — something you cannot provide. And working at such a level of granularity would make any SQL statement, even a not-so-complicated one, be composed of tens or hundreds of models.
I have noticed some progress in this field lately. One promising solution is the DBT plugin I found, which could be a good starting point for addressing this issue. Please keep in mind that to utilize the framework, it is necessary to implement accurate macros in your models for production code. While I personally do not prefer merging production and test code, it appears that with DBT, there may not be any alternative at the moment.
Legacy SQL is too complicated
Make your models testable! So many projects are developed recklessly and evolve over time. What used to be a simple select statement some time ago has now become a monster that nobody understands anymore. In software engineering, we have a concept of a test pyramid — for a given program, there are unit, integration, and functional tests, each at a different level of granularity. SQL is expressive and that’s a huge benefit but also a drawback in certain contexts. It is easy to write a complex SQL query that makes it difficult to test. It is similar to wanting to write only functional tests for your programs when you don’t have a smaller unit to test!
DBT is an excellent tool that helps manage the complexity of your models by allowing you to reuse smaller and more useful models, which in turn makes them more easily testable. However, it’s important not to confuse this with model refinement, where the data is transformed from bronze to silver and gold models. This level of granularity is often too small to make these models easily testable. To ensure that your SQL code is easily understandable and extendable in the future, it’s best to use clean separation of concerns, CTEs, and intermediate models. This is typically an art rather than a science.
Watch out though! CTEs and intermediate views don’t come at no cost. Take a look at my article on why CTEs can have negative performance consequences on your models.
Convincing clients it is worth investing in testing
I believe that over twenty years ago, the software engineering industry faced a similar challenge. People wanted to deliver changes quickly and ensure that they were correct and reliable, but they didn’t fully comprehend how to achieve that goal. Writing tests can be expensive, particularly with approaches like Test Driven Development that place more emphasis on testing than coding, making tests seem more costly. However, even if you don’t use this method, I would argue that fixing poorly written tests (which every company will eventually require, unless they are completely reckless) is even more time-consuming.
To give a concrete example, in the legacy projects I worked on, 80% of my time was spent writing or fixing poorly written tests, rather than developing features. I think the data industry is even further away from that state of affairs. If you wanted to introduce tests in a company that has neglected them since its inception, you would be better off building a new platform from the ground up. However, even with such an investment, due to bad habits, company and industry culture, you would eventually end up with a mess unless you hire a team of strong and devoted engineers to maintain the platform. These engineers would need to enforce strict policies and ensure that any corner-cutting is quickly rectified.
Models are not enough: complexity of various integrations
Keep in mind that models are not the only things that need to be tested in your platform. The project as a whole usually involves infrastructure, data pipelines, ingestion and integrations, schema definitions, visualizations, and orchestration. Typically, you cannot rely on just a production environment. With infrastructure as code, it should be relatively easy for you to spin up new environments.
Starting a new environment is just the first step, as you also need to ensure that the environment is properly populated with data. There are generally two options for this: generating artificial data or copying data from the production environment. If you choose to copy the data, you need to consider the right strategy, such as zero-copy clones or data sampling. Additionally, you must ensure that production data is properly masked to protect sensitive information. There are other considerations as well, such as development, staging, and acceptance environments, which must not be neglected if you want people to use them effectively.
Copying data can have drawbacks as it may not fully test all parts of your integrations, such as the Fivetran connector configuration. Having a development data source from the other party is ideal, but not always possible. Therefore, it’s important to have a strategy in place to validate if your syncs are correct. Additionally, when using third-party integrations, it’s important to consider if you can fully trust them without any validations.
Having a strategy to identify and respond to any schema violations is also important. Although you may catch them in production, the earlier you detect any discrepancies, the faster you can react and minimize the impact.
What about the orchestration layer? How do you ensure that your new job is configured correctly and scheduled properly? I’ve seen companies get caught off guard by the fact that orchestrators are typically based on UTC time, resulting in reports being sent at different times during daylight savings time. This may not be acceptable, but without proper means of verification, you may not realize it is happening.
Lastly, how would you go about testing your visualization layer? Some teams make the mistake of solely testing the final model that is utilized by their BI tool, overlooking crucial factors such as loading performance and any custom transformations implemented in the BI tool. To limit implications of this integration, you can take advantage of the tool’s caching capabilities and promote a culture of limiting transformations in the BI layer, with all transformations applied at the transformation layer. However, achieving this may require additional tools for business users, such as data catalogs, to quickly locate what they need and avoid reverting to old habits of duplicating logic within the tool.
Conclusion
Some might argue that this was more of a rant than a regular post, but I had to share my thoughts with the broader community and see if my fellow engineers feel the same way. I pointed out several issues that are contributing to the current testing landscape around data, and I truly want it to change. However, without enough traction and stakeholders who understand the value of testing, the change will come slowly.