Is star-schema a thing in 2024? A closer look at the OBTs
What are the OBTs and why do they excel in streaming?
In the era of distributed data warehouses and affordable cloud storage, some companies tend to overlook the necessity of proper data modeling. With the emergence of ELT tools, the growing complexity of data, and the shift away from batch processing models, there’s a stronger temptation to be flexible with schemas. If you examine your cloud bill closely for analytical workloads, you’ll notice that the majority of costs stem from compute resources. Denormalizing your models to save valuable time on joins becomes enticing. In the age of OBTs, is there still a need to care about models such as the star-schema?
OBT = One Big Table
If you’re old enough, you might recall the days when analytical workloads ran directly against production OLTP databases. In that era, we constructed materialized views that prejoined multiple tables, resulting in a single, wide, denormalized table. This approach made data consumption and understanding much simpler for business users, relieving them from the complexities of managing table relationships. Ideally, all queries should be resolved by querying a single view, often fetching just a single row.
In theory, at least. However, maintaining these views in row-based databases was a difficult task. They were resource-intensive, challenging to update, and not so efficient. Fast forward to 2024, and we find ourselves in a different world: abundant compute power, cost-effective storage, utilization of columnar data formats, aggressive shard pruning, efficient push-down filters, and fewer maintenance costs of our SaaS warehouses.
One may argue that wide tables violate the DRY principle, and they wouldn’t be wrong. Advocates of OBTs say that in modern cloud environments, the cost of joins outweighs storage expenses. Moreover, querying wide tables is faster. This comes at a cost though.
Managing change
One of the most challenging aspects of OBTs, similarly to any denormalized model, is dealing with changes. In a star schema, updates to dimensions are promptly reflected across all rows, facilitated by joins. However, with an OBT, updating requires modifying all affected rows. Merely identifying these rows becomes problematic, especially at scale.
Schema changes pose similar difficulties. OBTs function similar to an API layer or a data product, requiring schema compatibility maintenance to prevent confusion or failures among end users, such as services or BI tools. While some schema changes, like adding a new column, may be relatively easier with columnar storage, they are far from trivial.
Nonetheless, this challenge might not be as problematic as it initially appears. For batch ETL workloads, data often remains stale (as we usually update the sources on schedule), and these inaccuracies are acceptable to the business users. Few companies have prioritized real-time updates to their analytical dashboards.
However, the landscape is changing. The emergence of new streaming technologies, coupled with initiatives from companies like Aiven, Databricks, or Confluent, is pushing for real-time updates in analytical workloads. In my view, OBTs excel in streaming use cases.
Source of truth in real-time systems
If you want to build a resilient, fault-tolerant, and accurate streaming platform, you must shift your perspective regarding the source of truth for your data. Merely depending on the state of tables in your data warehouse, which can quickly become outdated, is no longer sufficient. Instead, you must rely on data streams. Investing in robust Change Data Capture (CDC) infrastructure becomes important, much like you relied on tools such as Fivetran or Airbyte to accurately transfer data from production databases to your warehouse. Depending on an external system often undermines correctness, particularly in recovery scenarios.
Streaming engines typically operate in distributed environments, where unexpected issues can arise. Machines may fail, commits might not succeed, network connectivity between nodes could be disrupted, APIs used in enrichment jobs might become unavailable, bugs in your code or the streaming engine itself could emerge — the possibilities are endless.
Let’s consider using a star schema as a data model for a streaming workload, with two streaming pipelines:
- the first pipe updates a dimension table in real time,
- the second pipe updates your fact table in real time.
In a typical scenario, the dimension in the source system would be created well before or simultaneously with the creation of the fact. This is essentially guaranteed, as the foreign key relationship in the database would enforce it. Ideally, records from both would arrive at the engine simultaneously, undergo processing, and issue real-time updates to the warehouse. Seems straightforward, right?
Now, consider a situation where the processing node for the job handling dimension data goes down (alternatively, an event could be lost, or events might get reordered at the network level, among other potential issues.). The fact table job would continue processing data as usual, resulting in data being populated into the warehouse. This could lead to a couple of outcomes:
- If your warehouse doesn’t enforce foreign key relationships between tables, you might end up with a record that violates these constraints. This could be acceptable or problematic depending on your use case.
- If the warehouse does enforce these relationships, it’s likely that the other job would fail. Imagine having tens or hundreds of jobs relying on the same dimension table. The simultaneous failure of numerous jobs could overwhelm your SRE team, diverting attention from addressing the root cause of the issue.
There are numerous cases where relying on coordination between independent streaming jobs could lead to issues; the one mentioned is just one example. That’s why it’s crucial to leverage streams as the primary source of data in your streaming jobs whenever feasible.
Now, let’s consider building an OBT instead. In this scenario, we have a single streaming job that subscribes to all necessary topics, conducts the join operation, and saves the result back to the warehouse. In our diagram, it would be just a single job. However, you can imagine a scenario where multiple fact tables rely on the products dimension. In such a case, each job responsible for building that OBT would independently consume the products topic.
In the event of a job failure, we would observe no records in the output at all. This significantly minimizes the impact factor, at the cost of redundant topic consumption, delay in case of failures and increased storage in the warehouse.
Does this approach solve the update issue?
The painful truth is — it doesn’t. Updates to dimensions (as well as to other facts, if they’re joined) are resource-intensive and often expensive. Usually, it is a game of trade-offs:
- if you prioritize correctness at all times, you essentially need to store all source records in the job’s state. Fortunately, RocksDB, a database engine commonly used for state management in major streaming engines, is proven to efficiently handle large states (petabytes),
- you can help your streaming engine in managing the amount of state it maintains by leveraging interval or temporal joins instead of regular joins,
- alternatively, you might resort to lookup tables if you only care about the latest update. This could take the form of external data enrichment using a fast, in-memory database like Redis.
Changes to the schema are often tricky, but in event-based or streaming architectures, it’s somewhat intrinsic that you should already be thinking in these terms. With technologies such as Avro or schema registry being prevalent in the streaming world, it’s nearly impossible not to. If you’re unfamiliar with these concepts, I recommend exploring them independently, as they’re somewhat beyond the scope of this article.
Building OBTs from star-schema
The good news is that if you already have a star-schema in place, it’s relatively straightforward to construct an OBT from your data. If you’re using tools like DBT or a Modern Data Platform, it’s simply a matter of writing a new model. You can likely accomplish this with your existing ETL platform. But why would you want to do it?
OBTs offer several advantages over a robust data model. They are typically simpler to build, making them particularly appealing for small teams. Managing a complex, normalized data model with numerous data products and ad-hoc queries can become troublesome, especially when you get flooded with the questions from business users about where to find the data they need. As an alternative, just direct them to a particular OBT.
However, this doesn’t mean you can eliminate data modeling entirely. OBTs still require careful design. Here are some considerations:
- what columns should the OBT include? Even though storage is inexpensive, including all columns and performing hundreds of joins to build them can quickly lead to a data swamp. It’s essential to strike a balance between complex OBTs and having many more specialized OBTs,
- how relevant are updates to the dimensions? You don’t want a system where each change to a dimension triggers millions of updates, resulting in significant write amplification. This would cause streaming jobs to get stuck building wide records, and in batch case, your database would be constantly updating ever-changing records. Further increasing duration of incremental build runs,
- what is the physical data format? With larger data volumes, it’s essential to optimize data querying for efficiency. This could involve choosing the right partitioning scheme or distribution key.
This might become a handy resource while migrating from batch to streaming. Comparing results between existing batch platforms and newly created streaming platforms isn’t always straightforward. By building OBTs in both environments, you can more easily compare apples to apples.
Is it wise to get rid of the modeling completely?
Should you prioritize performance over modeling in your project? I can’t say for sure, as every project is different and the decision depends on your specific use case. When leading tech projects, I often prefer to begin with the analysis of the use cases. People, especially those non-tech savvy, may believe they require streaming or sophisticated technology to address their issues, but this isn’t always the case.
However, this doesn’t deny that an increasing number of companies are requiring streaming systems to meet their use case demands. They’re recognizing that real-time analytics and operations can generate revenue and provide a competitive advantage. In today’s fast-paced world, batch processes often fail to provide a competitive edge in many domains. I anticipate this trend will only accelerate in the coming years. OBTs serve as a straightforward entry point for many.