The growing importance of data governance

With more and more AI integrations coming, the importance of governance will only expand. Explore the current landscape and discover essential tools for effective implementation!

Adrian Bednarz
10 min readJun 29, 2024
Photo by Kaffeebart on Unsplash

Data governance evokes mixed reactions. Many vendors promote their governance solutions, constantly introducing new features. Executives tend to go along with this trend, frequently hiring external companies to integrate governance into their workflows — mostly out of fear of serious compliance fines. Meanwhile, developers and data engineers find themselves in a new reality with restricted access to production data, leading to frustrations with debugging and bug fixing.

It’s clear that making any software project robust and well-maintained requires both budget and manpower (at least until AI, like Devin, takes over our jobs). Many engineers in companies with strict governance policies find these policies challenging to work with, often nonsensical. Often looking for ways to bypass the rules to access the data they need, despite strict security measures. This is obviously problematic for everyone involved. However, this isn’t much worse than a company with top-notch governance solutions that still shares passwords in plaintext or only cares about governance in their cold storage but leaves other data sources (Kafka topics, NoSQL databases) full of unmasked PII data accessible to anyone.

Is all hope lost? I want to show you that governance is actually here to help, but it needs to be done correctly. I won’t pretend it’s easy. If you’re in the industry for a while, you’ll need to break a lot of old habits. Less experienced individuals will face another set of challenges, as the complexity of modern data systems is higher than ever. Grasping all of this, plus adding governance on top, will be even more overwhelming and harder.

What is data governance?

Unless you’ve been living under a rock, you’ve probably heard about data governance and have a general idea of what it means. In short, it is about making the data secure, available, and of high quality. However, like any buzzword, it comes with many misconceptions. So, let’s establish some basics about what I mean by governance. Keep in mind, this definition can vary from person to person.

  1. As a software engineer, you would consider governance in terms of data security and data validation. For you, the APIs need to adhere to company-wide standards since they might expose some of the data stored in the warehouse. In a distributed environment, maintaining quality might involve thinking about distributed transactions and consistency. You also need to document code that handles sensitive information and establish access policies. One of your biggest challenges is figuring out how to access production data when debugging issues.
  2. As a data engineer, governance is primarily about data quality, which extends beyond schema compliance to include data semantics. You’re often held responsible if pipelines produce incorrect results, so you pay close attention to data lineage and metadata management to ensure business users can easily find the datasets they need. You’re also concerned with compliance with regulations like GDPR, HIPAA, or CCPA, particularly regarding who can access what data and whether raw PII can be stored in the cloud.
  3. As a DevOps engineer, you focus on defining the right roles for entities (users or jobs) that access data. You want this to be scalable, ideally with a self-serve platform for common use cases. Enabling the SRE team with visibility through metrics, logging, and dashboards is crucial. You understand that security and compliance are tightly connected with configuration management, making Infrastructure as Code (IaC) essential. Backup strategies are also vital, as things can go wrong at the least expected moments. Additionally, gathering audit logs and monitoring for potential breach risks are key components of your responsibilities.
  4. As a business analyst, your primary goal is to work with clean, accurate, well-documented, high-quality data without having to worry about what data you need to access. You need a streamlined process to quickly request missing access, although having all data readily available would be ideal. Data accessibility in your preferred reporting tool is essential. Understanding data lineage is desirable, although it’s not commonly utilized by many analysts based on my experience. Collaboration with engineers to ensure easy access to underutilized data is also important to you.
  5. As an executive, you recognize the strategic value of data as a competitive asset that must be protected. Data governance plays a crucial role in preventing data breaches, ensuring compliance, and mitigating data loss. It involves establishing policies, hiring responsible stewards, and implementing frameworks across teams. You also consider the ROI of governance initiatives, weighing their financial impact and value. Auditing and managing access permissions are key concerns, ensuring that the right individuals have appropriate access. Additionally, you aim to maintain development pace without compromising on governance efforts.

Is governance all that necessary?

Even though, every persona has their own perspective, it all boils down to a couple of important arguments:

  • Privacy and compliance with regulations like GDPR, HIPAA, and CCPA are crucial. Companies must avoid the risk of financial ruin due to non-compliance.
  • Working in silos leads to inefficiencies and duplicated efforts. Implementing effective governance promotes data reusability.
  • Human error is inevitable, and security breaches can occur. Governance helps mitigate the risk of data exposure in the event of a breach.

Governance isn’t just about risk avoidance or making engineers’ life difficult. A well-implemented governance strategy offers several benefits:

  • Datasets become more discoverable, ensuring high quality and enabling automatic and quick bug detection.
  • Onboarding new datasets from existing sources becomes easier, and access is limited to necessary data, reducing costs. Self-service platforms simplify the data access request process.
  • It enhances dataset documentation and promotes cross-team collaboration. Clean datasets simplify AI integrations.

What are some success mesaures to track?

Even if you recognize the importance of governance, estimating the associated costs can be challenging. Without clear KPIs, it is often challenging to conduct any project. Consider how many dedicated personnel most companies allocate to governance — if you work for an average company, the number is likely low. There’s a misconception that governance will not tangibly contribute to driving business growth.

You might have your own feelings about governance based on your experiences. Do you believe it simply slows you down? Let me ask you this, if you have a background in software engineering, do you think adding end-to-end tests to a project also impedes the progress? While these policies may not directly expedite developing data pipelines, they play a crucial role in the broader architecture.

People have to advocate more for making governance a top priority in the projects. Ignoring or relegating it to the sidelines should not be an option. It’s crucial to establish the right measures, here are some ideas to consider:

  • Number of decisions made per month based on data.
  • Cost savings, business opportunities, and revenue growth attributable to these decisions.
  • Sentiment analysis of stakeholder feedback.
  • Time spent in cross-functional meetings due to inefficient collaboration.
  • Data quality measures: including data errors, reported bugs, and data completeness fraction.
  • Number of security incidents.

Keep track of these numbers as you migrate more and more datasets and use cases on the governance journey.

How much governance is enough?

People vary in their opinions on how much investment a project should allocate to governance. The same diversity applies to the tools available in the market. Many vendors focus on specific aspects of the governance model: masking sensitive data, data catalogs, schema enforcement, and approval workflows for new data products. While these aspects are crucial, no tool usually encompasses the full spectrum of governance needs.

Governance requires a meta layer above all data sources used, ensuring validation of proper application of access policies. These tasks often prove challenging to automate, and there’s a lack of dedicated platform to manage the variety of sources effectively. With the advent of AI, data quality and correct access patterns will become pivotal in any AI solution. Customers are starting to raise concerns about their data privacy. Without a robust audit framework for data access, particularly sensitive data, you’re asking for troubles and getting questions that are impossible to answer.

Modern platforms usually aren’t equally focused on all governance aspects. One area that I can think of is the proper data testing. While strict, backward compatible schemas are essential for integration, they don’t prevent passing blatantly irrelevant data between systems despite conforming to the schema. Achieving high data quality must focus on implementing smarter tests that go beyond basic DBT unit tests. Just as software requires rigorous testing before adopting true continuous deployment, we should be able to tell whether our data products are capable of accurately answering critical business questions they are meant to answer. Adding or modifying data products should involve careful consideration of the implications of these changes, something that tend to be overlooked.

What tools are available?

Each major data warehouse and lakehouse provider offers robust features to build your own governance framework. Snowflake provides data quality monitoring and metric functions. It also allows users to define column-level and row-level security policies using object tagging and tag-based policies. For data auditing, Snowflake offers Trust Center. It integrates smoothly with DBT for additional checks as needed. While their cataloging and lineage is not one of my favourites, Snowflake’s decision to open source Polaris catalog (formerly Iceberg catalog) indicates that they want to further invest in this area.

With Databricks, you benefit from their excellent Unity Catalog, which provides features for catalog, lineage and auditing. Data quality for DLTs can be assessed using the expectations framework. They support data masking through masking functions. As is often the case with Databricks, creative problem-solving is needed for more extensive testing and general data quality. I recommend exploring their notebooks and workflows for these purposes.

Both aim to establish themselves as comprehensive AI platforms. Databricks strives to bridge the usability gap, as it is seen as a company specializing in custom data warehouses. Snowflake seems to be more focused on its standalone compute capabilities, including AI model training, currently being recognized primarily as a “database in cloud”. As mentioned above, recently both companies have made significant investments in open source data governance. The future seems really exciting!

Among the open source warehouses, Apache Doris currently lags behind in the governance area. However, with an increasing traction in open-source governance projects, it might actually benefit from that “war” and quickly narrow the gap.

Cloud-based warehouses are usually deployed on the major cloud providers infrastructure. These providers have their own set of services that are well suitable to compete in the governance space. AWS offers well-established AWS Glue Catalog along with services like AWS Glue Data Quality, Amazon Macie for PII protection, AWS CloudTrail for auditing, and AWS Glue DataBrew for data cleaning and testing. In addition there is another governance solution you might suit your needs: AWS DataZone. Additionally, DBT can be integrated with services like Athena or Redshift that enables more advanced use cases.

On Google Cloud Platform (GCP), you would likely utilize Data Catalog, Dataplex, Sensitive Data Protection for safeguarding PII, and Cloud Audit Logs for, well, auditing. Similarly, DBT can be integrated with services like BigQuery to further enhance data transformation capabilities.

How about streaming vendors?

In the streaming space, governance solutions are a bit less mature, but they are catching up. Out-of-the-box offerings for data quality, PII detection and testing may be limited. Clients are usually interested in auditing capabilities, enforcing schemas via tools like Schema Registry and managing permissions with tools such as KLAW. Data catalogs specific to streaming exist with Confluent Stream Catalog and VVP Catalog being top contenders.

Major platforms offer a reasonable governance capabilities with RBAC access and workspace / tenant isolation. They feature data catalogs and can integrate with external catalogs (mostly Hive or JDBC-compatible ones). You surely expect the platform to provide audit logging. Ververica, Aiven and Confluent satisfy this baseline.

There is still a gap in tooling that allows for granular access control at the individual field level within events. While data warehouse access controls often provides such granularity, streaming platforms typically do not. This is an active area of development though, Confluent Platform already supports data quality rules. I like the way you define these rules, they are automatically shared with the producer or consumer applications. Different sets of rules can be applied during migrations (schema upgrades or downgrades) or when reading and writing data.

I understand that the governance in streaming is significantly more challenging. Data quality checks need to operate in real time without causing delays in decision-making. Things are less strict for real-time analytics of course. Currently, platforms have not yet fully developed solutions to deliver either capability effectively.

The landscape is evolving rapidly. Rockset, a real-time analytics database recently acquired by OpenAI, already offers governance capabilities. Other products in the streaming space for specific use cases are also highlighting the importance of governance. For example, you can check out this guide from TinyBird, the team behind a database built on top of ClickHouse.

Conclusion

Although I tend to be picky and critical in my assessments, I wouldn’t say the governance landscape is in bad shape. There’s a lot of traction, with both users and executives beginning to understand its importance. The open sourcing of Unity Catalog and Polaris further demonstrates that the industry is now treating governance with the seriousness it deserves.

On the other hand, projects like OpenLineage and Marquez have been around for a while, yet they are not commonly deployed in companies. Governance is likely to gain popularity due to the rise of AI, which will increase privacy concerns. Companies will need to ensure they have the right policies in place for data access. This increased focus on governance is ultimately beneficial for the end consumer.

Does this mean you should worry if your governance practices aren’t top notch? Not necessarily. While governance can’t be ignored completely, it’s natural for companies to place more emphasis on it as they mature. For startups and mid-sized companies, existing governance tools generally meet their needs. It’s primarily a matter of allocating time to manage policies properly.

Regarding streaming, I see more and more inquiries about the governance landscape. There are many excellent articles from major streaming platform providers showcasing their governance capabilities. While you can’t build Rome in a day, these providers are doing an excellent job bridging the gap. However, be mindful that your journey may not be as straightforward as the marketing materials suggest, and you might need the help of experienced streaming consultants to get started.

--

--

Adrian Bednarz
Adrian Bednarz

Written by Adrian Bednarz

Staff Data Engineer — AI / Big Data / ML

No responses yet