What is Unity Catalog?
Databricks launched Unity Catalog in 2021 to provide a unified and interoperable catalog for data and AI workloads, featuring centralized access control, auditing, lineage, and data discovery capabilities. By offering a centralized approach to managing access controls and a shared metastore, it streamlines data governance. In June 2024, Unity Catalog was
open-sourced, becoming the first open-source catalog for data and AI governance across multiple clouds, data formats, and platforms.
What Drives the Open-Sourcing of Unity Catalog
Unity Catalog has been
open-sourced largely due to community demand for a more flexible and interoperable data governance solution. The platform is built on an open-source API and implementation, leveraging the OpenAPI specification and featuring a server implementation under the Apache 2.0 license. It is compatible with Apache Hive's metastore API and
Apache Iceberg's REST catalog API, promoting seamless integration with existing data governance frameworks and enhancing interoperability across various systems.
Supporting multiple data formats such as Delta Lake, Apache Iceberg via UniForm, Apache Parquet, and CSV, Unity Catalog ensures efficient management of diverse data assets. Its extensible architecture allows data cataloged within Unity Catalog to be accessed by virtually all compute engines, facilitating seamless integration into various data processing workflows through open APIs.
The decision to open-source Unity Catalog advances open standards in data governance, encouraging the use of open APIs and improving interoperability across different data systems and tools. This approach enables organizations to integrate various data sources without vendor lock-in, promoting flexibility and adaptability in managing data assets.
Unity Catalog's architecture emphasizes three core principles: open connectivity, unified governance for data and AI, and open access. Open connectivity facilitates the integration of any data source, while unified governance ensures consistent application of security, compliance, and quality standards across both data and AI assets. Built on open standards, Unity Catalog supports access from any compute engine or client, ensuring broad compatibility with various data processing tools.
In addition to supporting tables, files, functions, and AI models, Unity Catalog benefits from a vibrant ecosystem of industry leaders, including Amazon Web Services, Microsoft Azure, Google Cloud, Nvidia, Salesforce, DuckDB, LangChain, dbt Labs, Fivetran, Confluent, and Onehouse. CelerData/
StarRocks are also among the official partners in this effort, and we are excited to deliver the world’s best query performance to Unity Catalog users, providing enhanced support for those seeking to free their data lakehouse and use the best solutions for their business.
What Data Challenges Does Databricks Unity Catalog Solve?
Fragmented Data Governance
Problem: Traditionally, data governance has been managed at the workspace level, resulting in fragmented and inconsistent policies. Each workspace operated in isolation, with separate configurations for user management, access controls, and data policies. This fragmentation made enforcing consistent governance standards across the organization difficult.
Solution: Unity Catalog centralizes
data governance, allowing organizations to manage metadata, access controls, and data policies from a unified interface. By consolidating these elements at the account level, Unity Catalog ensures consistent governance policies across all workspaces, simplifying administration and improving compliance.
Complex Identity and Access Management
Problem: Managing user identities and access controls across multiple workspaces is cumbersome and error-prone. Each workspace required separate user and group configurations, increasing administrative overhead and the risk of misconfigurations that could lead to security vulnerabilities.
Solution: Unity Catalog introduces identity federation, centralizing user and group management at the account level. This approach streamlines identity management, ensuring that all users and groups are consistently synchronized across all workspaces. It reduces administrative overhead and enhances security by providing a single source of truth for identity and access management.
Inefficient Data Lineage and Quality Tracking
Problem: Understanding
data lineage and ensuring data quality are crucial for maintaining data integrity and compliance. However, tracking data lineage and quality across disparate systems and workspaces can be challenging, leading to incomplete or inaccurate information.
Solution: Unity Catalog offers comprehensive data lineage tracking and quality monitoring capabilities. It provides a visual representation of data flow, allowing users to see how data moves and transforms across different tables and columns. This transparency helps organizations ensure data integrity and compliance by providing a clear understanding of data origins, transformations, and usage.
Regulatory Compliance and Security
Problem: Compliance with regulations such as
GDPR,
CCPA, and industry-specific standards is a significant concern for organizations. Ensuring that data governance policies align with regulatory requirements and that sensitive data is adequately protected can be complex and resource-intensive.
Solution: Unity Catalog simplifies compliance by centralizing governance policies and providing tools for data masking, access control, and auditing. It enables organizations to implement and enforce policies that protect sensitive data and ensure compliance with regulatory requirements. By providing a unified governance framework, Unity Catalog helps organizations maintain a secure and compliant data environment.
Scalability and Flexibility
Problem: As organizations grow and their data environments become more complex, scaling data governance practices can be challenging. Traditional governance solutions may struggle to keep up with the increasing volume and variety of data, leading to gaps in governance and potential compliance issues.
Solution: Unity Catalog is designed to scale with organizational growth, supporting extensive data environments across multiple clouds and regions. It provides flexible governance capabilities that can adapt to the evolving needs of the organization, ensuring that data governance practices remain robust and effective as the data landscape expands.
What Are the Key Components of Databricks Unity Catalog?
Metastore
The metastore in Unity Catalog is the central repository for metadata management. It acts as a top-level container for various governance elements, including catalogs, schemas, tables, and views. Unlike traditional Hive metastores, Unity Catalog's metastore is designed to support centralized data governance across multiple Databricks workspaces.
Catalog
A
catalog is a logical collection of schemas (or databases). It provides an organizational layer that helps manage data at a higher level, allowing for better separation of data by business unit, environment (such as development, staging, and production), or other logical groupings. Each catalog is contained within a metastore and inherits its governance policies.
Schema (Database)
A
schema, also known as a database, is a collection of tables and views. Schemas help organize data within a catalog, providing a structured way to manage and access data assets. They inherit governance policies from their parent catalog.
Tables and Views
Tables are structured collections of data, typically organized into rows and columns. Views are virtual tables created by querying one or more tables. Unity Catalog manages both tables and views, ensuring that governance policies such as access controls and data lineage tracking are consistently applied.
Identity Federation
Unity Catalog introduces identity federation, which centralizes user and group management at the account level. This ensures that identities are consistently synchronized across all workspaces, simplifying user management and enhancing security. Identity federation reduces administrative overhead by providing a single source of truth for identity and access management.
Access Control Lists (ACLs)
Access Control Lists are used to define fine-grained permissions for users and groups on catalogs, schemas, tables, and views. Unity Catalog's ACLs enable precise control over who can access and manipulate data, ensuring that security and compliance requirements are met.
Data Lineage and Quality Monitoring
Unity Catalog includes robust data lineage and quality monitoring capabilities. Data lineage provides a visual representation of data flow, showing how data moves and transforms across different tables and columns. Quality monitoring helps ensure data integrity by tracking and enforcing data quality standards.
Unity Catalog in Action: Achieving Interoperability with Unity Catalog, Delta UniForm, and StarRocks
Unity Catalog facilitates seamless interoperability between various data formats and compute engines. Whether it's Delta Lake, Apache Iceberg, or Apache Hudi, this tool ensures that data can be accessed and managed efficiently across different platforms. This capability is critical for modern data infrastructures, especially in AI-driven environments where diverse data must be analyzed and integrated across multiple engines. By simplifying system integration, Unity Catalog not only provides a consistent user experience but also reduces development time and effort, enabling faster solution deployment.
To see this in action, we will demonstrate how StarRocks, a powerful open-source query engine supporting Delta via Delta Kernel Java, works alongside Unity Catalog and Delta UniForm to easily manage different table formats in the video below.
Conclusion
Databricks Unity Catalog offers a robust and comprehensive solution for tackling key data governance challenges. By centralizing metadata, access controls, and governance policies, it helps streamline data management and enhances security across multiple workspaces. Its features, including identity federation, data lineage tracking, and quality monitoring, are designed to ensure data integrity and compliance in complex data environments.
With its open-source architecture, Unity Catalog supports seamless integration with various data systems and formats, promoting interoperability and flexibility. This approach enables organizations to effectively manage their data assets without being tied to a specific vendor. As data environments continue to grow and evolve, Unity Catalog provides the necessary tools to maintain consistent and scalable governance practices.