DBT repository — to split or not to split?
DBT encourages monorepo model. Should you split it? If so, what are the options?
If you are new to DBT there are so many things to grasp that you might not even ask yourself that question. But as organizations migrate more and more workloads to DBT they are likely to face some, likely unexpected, problems with code organization, CI execution times and approval process. Starting with monorepo is a wise thing to do — after all, that’s the simplest model to work with and it has potential to greatly reduce amount of duplicated code the teams may produce. Onboarding new members is easy because they can see a lot of examples of code created by other team members. Obviously, a lot of code duplication can be removed by the use of macros but there are much more benefits from seeing a lot of code examples like naming conventions and such.
When working with code we have to think about much more than just getting the SQL out. Companies want to have control of what gets pushed to production. The changes should be properly tested and reviewed. Processes should be set in place so that relevant people review and approve code that have direct effect on their domain. Once the repository grows, such tasks get more and more nuanced.
From my experience, most successful DBT deployments has to evolve to multi-repository model at some point. There is no definitive answer when this should be performed — whenever there is more and more operational overhead with running repository, PR workflow or CI jobs.
DBT packages
DBT packages are the answer. Just as software packages provide reusable pieces for engineers to work with (so that they don’t reinvent the wheel over and over again) DBT supports similar functionality. Technically, these are regular DBT projects that can be imported into other projects to enhance their functionalities. One of the most popular packages are dbt-utils
and dbt-expectations
.
In order to use a package, you reference it in packages.yml
file at the root of DBT project. You install them with dbt deps
command. Packages can be public (most often shared on DBT hub, there are other options though like referring to public git repository) or private (e.g. imported via a reference to private git repository — of course the machine that deploys models needs to have access to that repo).
Common repository
Packages can be used to extract common pieces from multiple projects. This makes them a perfect candidate for a means to introduce a common repository. Most often, you would include domain-agnostic model and source definitions, macros. You would then import such package into all domain / team specific DBT projects. This works in favor of DRY principle.
Most often this is a proprietary information, you can use private git repository integration to include them into your projects if you want to keep the repositories separate.
There is also an option to reorganize your monorepo in a way that makes it easier in the future to decouple it into a few separate repositories. DBT supports local packages — all you need to provide is a relative-path reference to embedded DBT project
# in packages.yml
packages:
- local: subprojects/finance
This is the easiest way to get you started reorganizing code. No matter if you go with private git repos or local packages — you can use cross-project references. For instance, ref
function may accept two arguments
ref('[subproject name]', '[model name]')
Such organization doesn’t solve problems with PR management process and long CI times. Yet it prepares you for seamless transition into multi-repository model with a common repo imported to all subprojects.
Semantic versioning of a common repository
Shared repositories should be versioned. If you are not coming from software engineering background you might not be aware what semantic versioning is. In short, this is a unified way of keeping track of changes in packages. Usually the version consists of three numbers connected with a dot like 0.7.1
. In this example
1
refers to patch version. We bump it when we enhance or fix bugs in existing functionalities. The next patch version in our example would be0.7.2
7
refers to minor version. This gets bumped if we don’t introduce any breaking changes but we introduce new functionalities. Next version:0.8.0
0
refers to major version. Is modified whenever a breaking change is introduced (functionality is removed, API changes drastically etc. — anything that may require users of the library to migrate their code in order to use new version). Projects don’t update their major versions too often. Here, it would be version1.0.0
In case of DBT projects, patch version would change on any bug fix in queries. Minor version changes on adding a new source. And major would change on Stitch to Fivetran migration when existing source tables would require swapping (with incompatible schema changes). This is just an example though and your team may have a different policy on when to bump which version number.
Multi repo enhancements
Having a common project with domain / team level projects that inherit it is just one option. More advanced organization can reorganize their code in more layered structure — where each domain repository can be split into smaller repositories with clear responsibilities like data cleansing, populating warehouse, backing BI reports and more. Maybe it makes sense for your organization to have single common repository that is responsible for all data cleansing and deduplication.
Once you understand this model and have it integrated with your tooling you can tweak it to best suit organization needs.
Have processes in place
Code organization is one thing, leveraging the new structure is another pair of shoes. Think what other benefits may your organization get by splitting the code into the modules. You have full flexibility to decide at repository level what is the policy for
- dev to QA / QA to production code promotion policy
- data access policy for each repository
- people responsible for approval process
Having one, large monorepo makes it harder to decide on such things. Majority of changes are touching just a handful of views and impact just a small portion of the organization. Yet this is not immediately obvious by looking at the PR. With more granular repositories we can more directly estimate the possible impact and the people responsible for necessary checks.
Lineage and docs
One of the drawbacks of splitting repositories into multiple ones is the lack of global lineage graph and common documentation. This is a huge selling point of DBT and it would be a shame if we could do nothing about it.
One of the simplest options is to create another project that imports all domain repositories just for documentation purposes. Running dbt docs
against such repository will generate docs for all projects that you can explore. It also gives you possibility to look at the documentation for single project too.
Some companies have their own lineage infrastructure based on OpenLineage and tools such as Marquez. It offers native dbt integration. In such case a common repository workaround might not be necessary.
It is also important to point out that if you have multiple DBT projects that run against various data warehouses (not a common use case though) then a common project approach won’t work and you need to set up some external infrastructure nonetheless. Single DBT project can be executed against one warehouse. Unless you use tools like Trino on top.
Conclusion
At a certain scale, moving out of monorepo seems like a smart step to do. The whole transition can be done gradually and the repository relationship structure may evolve over time. By the use of local packages to decouple monorepo, users may swiftly transition into external repositories.
You now understand the power of DBT packages and how they may be leveraged to reorganize repository structure. Finally, we explored options to get the global documentation and lineage graph straight.