The reasons why I like data mesh
Democratize your data without a dependency on a centralized team
Data mesh is one of the hottest topics in the data industry recently. Even though it is not a concrete technology or product, it has just as heavy impact on the space. It is a data architecture pattern that promotes data decentralization and democratization with a common governance body for data discovery and helping domain teams deliver their data products.
If you have software engineering background, you can think of data mesh as microservice architecture for data platforms. Just as microservices revolutionized changed software architecture space, data mesh is on its way to revolutionize data architecture.
Zhamak Dehghani identified four principles of data mesh
- Domain ownership — domain team is responsible for their own datasets. Central Data Engineering teams are no longer a bottleneck.
- Data as a product — data is not created as accidental byproduct. It is a foundation. Teams expose datasets that can later be used by other teams.
- Self-serve data platform — domain teams are able to spin up a new data plane quickly and independently. This increases pace at which they can deliver new data products.
- Federated governance — data products are standarized with adherance to internal rules and regulations. This include shared data formats, common data warehouse etc.
There is no fixed tech stack that delivers all this. Data mesh can serve as target architecture of a system or as a general set of guidelines to aspire to with existing platforms. I believe that especially smaller teams should not aim at deploying full platforms but rather cherry-picking bits and pieces to suit their current needs and make their platform more robust and mature.
Usually data mesh have some warehouse at its heart — be it Snowflake, BigQuery… For the domain data plane itself you can spin up cloud native solutions, build transformations with DBT — the options are endless and data mesh actually encourages to have this flexibility — even across domains in the same organization.
More than fabric, lake or warehouse
Data fabric is somewhat related to data mesh and the distinction can be difficult to comprehend with all those new buzzwords. Data fabric refers to a centralized platform where all data lands. Often built on top of a common data lake or data warehouse. With fabric, companies usually end up with a centralized team that everybody depends on. In decentralized model, domain teams can independently develop new data products faster and with better understanding of the business context.
Of course, in order to reuse datasets across domains you need some sort of standarization — thus the governance body is coherent part of a data mesh. In practice this will usually mean that you share the data warehouse or object storage for your data products.
Hub and spoke model is another approach that mixes centralized fabric with decentralized domain teams. Central team (hub) is responsible for pulling sources to common platform, domain teams (spokes) can build rich products on top of the high quality data.
For me, data mesh is just an evolution over data fabric concept. Many companies will still want to have a central team responsible for setting infrastructure, monitoring pipelines and promoting best practices for domain teams. In such model it doesn’t really matter if we named it hub and spoke or data mesh — as long as it brings business value. Let’s now look into the benefits of decentralized data platforms in general
Both in terms of performance, data volumes and delivery speeds — with cloud based platform it is really cheap to spin up new, independent platforms. This has its benefits and drawbacks — some duplication across domains will happen no matter how hard we try. And usually it is something we want to simplify dependencies chain — trying to reuse everything is just as bad as always copying code around. These are other sides of spectrum and neither is better.
The benefit for delivery scalability is twofold
- domain teams can fix bugs and improve existing data products independently
- new domain teams can be built by hiring more people — with self-serve platform the effort required to build a plane for them is minimal
Domain teams should not worry about domain-agnostic infrastructure. Usually at heart of a data driven company there will be a single data warehouse with data products. Provisioning roles, databases and compute should be something that happens automatically. We don’t want domain teams to be responsible for shared infrastructure — this could lead us down the path of a single domain domination. In such model, other domains would be forced to build many workarounds for their problems as they have little impact on the platform. This is exactly what usually happens with centralized data platform and why data mesh is so tempting.
Domain main responsibility is to take care of data model, schemas, transformations, ETL / ELT and data product metadata management. Domain may be responsible for certain parts of their specific infrastructure. There is nothing stopping them from spinning up their own Redis instance if they have use case that justifies that. It is also possible that central infrastructure team may provision and manage that instance. After all, we have a governance body to make such decisions on case by case basis.
As for concrete implementation, I’ve seen platforms where Terraform scripts executed from Github Pipelines by a Github App provisioned data plane for the domain team. That pipeline can provision all necessary resources in shared warehouse, create DBT project automatically, register labels for billing purposes, prepare monitoring and alerting infrastructure and much more. IaaC approach really helps with such tasks. With all accounts and permissions in place, it is much easier for domain team to define their own Fivetran pipes, spin up DBT models and have instant impact on business.
Less bottlenecks, faster delivery
Decentralization brings faster deliver. No longer is the central data team a blocker for introducing new changes. I’ve seen companies where different teams could build pipelines in other domains over a centralized platform. Although this removed a bottleneck and resulted in faster delivery — the long term consequences were detrimental. All the data had to pass through a central data lake which eventually became a data swamp. Pipelines were vastly different and used code-level hacks due to lack of understanding of a bigger picture. Four different methods of dumping data to S3, all in one repository? Please, don’t do it.
With domain teams it is less likely that people from other domains would like to contribute to places managed by your domain. Domain code uses domain-specific vocabulary so it is easier for people to understand what is going on. And all this compounds to less struggle when building new functionalities or fixing existing ones.
Central governing body
Teams work in silos not only due to technological limitation. Another major factor is that they don’t talk to each other, they don’t share ideas. And even if they do, it is on some abstract, artificial layer — like weekly tech presentations. Don’t get me wrong, this is definitely a move in right direction. Though I am a person who believes in direct action — and watching presentation on technical nuances implemented by one of the teams have little impact on my daily efficiency.
With governing body, teams establish common patters that have impact on all teams. They are free to discuss problems they face with domain-specific infrastructure and get advice from people across the company.
Shared parts of the system are well thought and interfaces are established to suit all. If existing interfaces don’t work, we can get input from many engineers on what is the best option to move forward. The decisions may evolve over time — at some point warehouse may not be enough and we will need a way to standarize things over Kafka topics, S3 data formats and bucket structure or machine learning model export format. I encourage teams to discuss any format or infrastructure decision they make — even if this applies to an internal data product. If you consider introducing new platform or storage, chances are other domain also discussed this. This might be a good opportunity to convince infrastructure team to provision yet another shared service for all domains to use if needed.
We’ve seen many reasons why data mesh is superior to central data platforms and I hope you get why I both support clients to introduce them and help them set them up. As any tool or framework, it is not a definite answer to every problem but definitely is a huge leap forward — possible mostly due to cloud and SQL. You should now understand that this is not a specific tool or platform, rather architecture or even philosophy to follow with your data projects.
Even though microservice architecture is omnipresent, there are still monolithic projects out there. The most important rule is not to over-engineer things. If you have a small data team and just a few use cases, deploying data mesh is not for you. Of course, being inspired by it is not a bad thing. Going overboard is.