Polaris (now Apache Polaris™ (incubating), incubating at Apache) is an open-source metadata catalog service designed specifically for Apache Iceberg. It implements the Iceberg REST Catalog API, enabling multiple compute engines (Spark, Flink, Trino, Snowflake, etc.) to consistently discover, read, and write the same tables stored in open formats like parquet—without copying data or losing atomic operations
Iceberg organizes data in parquet files with structured metadata:
Data files stored in parquet are listed in manifest files (which also store column stats, path, partition info).
A manifest list points to all manifests for a given snapshot.
A top-level metadata JSON points to the current snapshot.
This enables schema evolution, atomic writes, snapshotting, and time travel—all on standard parquet on cloud storage.
While Iceberg structures metadata within storage, you still need a catalog service to:
Map logical table names (e.g., sales.q1_2025
) to their metadata pointer,
Ensure atomic snapshot updates, and
Help engines discover tables, list namespaces, and enforce access control.
Open implementations like Hive metastore or Hadoop Catalog aren’t ideal in modern cloud-native, cross-engine environments.
Polaris implements Iceberg’s REST API, so clients like Spark or Trino simply point to:
catalog.type=rest
catalog.uri=https://my-polaris.example.com
and can issue:
CREATE TABLE sales.transactions (...) USING iceberg;
That handles manifest creation, metadata JSON updates, and consistent table publication across engines.
Engine writes parquet files; Polaris updates manifest, manifest list, and metadata JSON. Spark writes parquet file partitioned by date='2025-07-04'
, stats captured by Iceberg manifest automatically.
Polaris introduces:
Principal roles (e.g., “etl-engine”, “analyst-user”),
Catalog roles (e.g., reader, writer, admin),
And permissions that attach catalog-level, namespace-level, or table-level privileges.
etl-engine
role gets TABLE_WRITE_DATA
on sales.transactions
,
monthly-analyst
gets only TABLE_READ_DATA
.
When a user runs an INSERT on a parquet-backed table, Polaris issues scoped credentials and throws a 403 if they lack permission.
Instead of embedding global S3/GCS credentials in every engine:
Polaris vends short-lived, scoped tokens for each operation.
Tokens are scoped to specific directories or tables, redeemable only if RBAC allows.
Analyst queries a parquet-backed finance.revenue
table.
Polaris vends S3 token scoped to s3://lakehouse/finance/revenue/*
.
Attempt to access s3://lakehouse/hr/salaries/
fails due to token limits.
Internal catalogs: fully managed by Polaris (read/write).
External catalogs: reflect third-party systems (Glue, Hive, Snowflake-managed Iceberg). They support read-only access in Polaris, with metadata updates via notifications from the source.
You store Iceberg tables in AWS Glue. Polaris creates an external catalog “legacy-glue.” When tables change in Glue, a notifications endpoint tells Polaris to update pointers, and Polaris can vend credentials for those tables too.
Polaris supports multiple catalogs in a single deployment (multi-tenancy). An admin API (protected by OAuth) lets you:
Define principals,
Assign them roles,
Create, read, update, or delete catalogs,
Manage internal vs. external catalogs.
You create two catalogs: production
(internal) and partner_exposed
(external to share parquet data with partners). Admin API manages access across them.
Practical Deployment Scenarios for Polaris Catalog
Challenge: A data platform spans batch ETL, streaming ingestion, machine learning, and interactive analytics. These workflows use different engines—each with its own preferred way of querying or writing data. Without coordination, teams duplicate tables, lose schema consistency, and waste resources syncing data.
Architecture:
Event Ingestion: Kafka streams user clickstream events.
Processing: Spark jobs run hourly, transforming and writing enriched events to partitioned Iceberg tables stored as Parquet in S3.
Analytics: Trino connects to the same Iceberg tables via Polaris to power dashboards and ad-hoc queries.
ML Workloads: Python notebooks use PyIceberg to run training data extractions and feature joins.
Polaris Role:
Acts as the single catalog used by all engines.
Manages Iceberg metadata: manifest files, snapshots, schema evolution.
Enforces access policies through RBAC: Spark jobs can write, Trino users can read, ML users can only query approved tables.
Issues scoped storage credentials so each engine accesses only the Parquet directories they’re permitted to.
Example:
A Spark job writes to s3://lakehouse/events/2025/07/06
. Polaris records a new snapshot referencing those Parquet files. Moments later, Trino queries the table using Polaris, retrieving consistent schema and metadata. Machine learning pipelines consume the same table—without schema mismatches or duplicated datasets.
Challenge: An organization historically used Delta Lake for structured batch analytics. As it scales up interactive and multi-engine use, it wants to migrate to Apache Iceberg—without interrupting existing pipelines.
Architecture:
Legacy Delta tables are stored in s3://warehouse/delta/
.
New Iceberg tables are created in s3://warehouse/iceberg/
using Spark jobs.
Both formats must be queryable in a unified catalog during the migration window.
Polaris Role:
Supports generic table registration: Delta tables can be registered as external generic tables, while Iceberg tables are managed natively.
Enables Spark, Trino, and Snowflake to discover both types under one catalog.
Provides consistent namespace and role-based access control during the transition.
Example:
A Spark ETL job writes Parquet files to an Iceberg-managed table. Trino can query both finance.revenue_delta
and finance.revenue_iceberg
tables from the same Polaris catalog, while dashboards slowly migrate queries to Iceberg. Data engineers can deprecate Delta tables in stages.
Challenge: A third-party partner manages customer event data in their own AWS account using the AWS Glue catalog. Your data scientists need access to that data for reporting and model training—but the partner wants to retain full control over writes.
Architecture:
Partner owns s3://partner-data/events/
and manages metadata in AWS Glue.
Your team uses a self-hosted Polaris catalog that must reflect their tables for read-only access.
Internal dashboards and analytics must authenticate and securely access partner data without duplicating storage.
Polaris Role:
Registers the partner’s Glue tables as an external catalog (partner_catalog
) in Polaris.
Uses Polaris’s notification API to keep Iceberg metadata pointers in sync when partner updates occur upstream.
Issues time-bound, scoped credentials via credential vending: analysts get access only to s3://partner-data/events/
and cannot write or delete files.
Example:
A data scientist queries the partner_catalog.events_2025
table from Trino. Polaris validates their RBAC permissions, issues a token scoped to the Parquet files under that table, and serves a consistent view—even though the data itself is still controlled and written by the partner.
Each of these examples illustrates Polaris’s ability to function as a neutral, secure metadata coordination layer in modern data infrastructure. Whether facilitating multi-engine read/write workflows, enabling gradual table format transitions, or allowing governed third-party collaboration, Polaris brings consistency, security, and interoperability to Iceberg on Parquet at scale.
As open table formats like Apache Iceberg, Delta Lake, and Apache Hudi become foundational to modern lakehouse architectures, the metadata catalog layer plays a critical role. A catalog must reliably store metadata, enforce access controls, ensure ACID table operations, and enable multi-engine interoperability.
Three prominent open-file metadata catalogs in this space are:
Apache Polaris (incubating): an open-source REST-based Iceberg catalog that delivers consistent metadata for Spark, Flink, Trino, Snowflake, StarRocks, and more, with built-in access control and secure credential vending.
Databricks Unity Catalog (OSS since June 2024): originally a proprietary Databricks catalog now released under Apache 2.0. Unity Catalog offers multimodal data and AI governance, built-in lineage, native format support (Iceberg, Delta, Hudi), and cross-engine interoperability.
Dremio Nessie: an open-source, Git-style catalog for Iceberg. Nessie focuses on version control and collaboration, enabling branching, commits, and reproducible cross-table transactional workflows—ideal for developers and experimentation workflows.
Below is a detailed, up-to-date comparison across format support, engine compatibility, governance, deployment models, and tenancy to help you choose the right catalog for your use case.
Feature | Polaris | Dremio Nessie | Unity Catalog (OSS since Jun 2024) |
---|---|---|---|
Iceberg | ✅ Native support | ✅ Native | ✅ Native |
Delta Lake | ✅ Generic table integration | ❌ Not supported | ✅ Native |
Apache Hudi | ✅ Generic tables via REST | ❌ Not supported | ✅ Preview support |
Parquet/Generic tables | ✅ Generic via REST | ❌ Not supported | ✅ Supported via REST interfaces |
Polaris and Unity Catalog support more formats, including Delta and Hudi. Nessie remains Iceberg-only.
Feature | Polaris | Dremio Nessie | Unity Catalog |
---|---|---|---|
Interface | Iceberg REST Catalog API | Iceberg-native clients only | Unity REST API & Iceberg REST |
Spark | ✅ Yes | ✅ Yes | ✅ Yes |
Flink | ✅ Yes | ✅ Yes | ⚠️ Partial preview |
Trino/Presto | ✅ Yes | ✅ Yes | ✅ Yes |
StarRocks | ✅ Yes (REST External Catalog) | ❌ Not supported | ✅ Yes (via Iceberg REST) |
PyIceberg | ✅ Yes | ✅ Yes | ❌ Not yet supported natively |
Snowflake | ✅ Open Catalog integration | ❌ No | ⚠️ Partial, not first-class |
Polaris and Unity Catalog both support a wide set of engines. Nessie remains focused on Iceberg-native clients.
Feature | Polaris | Dremio Nessie | Unity Catalog |
---|---|---|---|
Self-hosting | ✅ Docker/Kubernetes | ✅ Yes | ✅ OSS version available |
Managed service | ✅ Snowflake-hosted option | ❌ No | ✅ Databricks-hosted + OSS BYOC upcoming |
Multi-cloud support | ✅ AWS/Azure/GCP | ✅ Yes | ⚠️ Databricks-native for now |
BYOC (host in your cloud) | ✅ Yes | ✅ Yes | ⚠️ Planned in OSS |
Polaris and Nessie accommodate self-hosted needs. Unity Catalog is now open-source, with deployment support emerging.
Feature | Polaris | Dremio Nessie | Unity Catalog |
---|---|---|---|
Multi-tenancy | ✅ Realms, roles via REST | ⚠️ Branch-level isolation | ✅ Workspaces + catalog roles |
Cross-catalog federation | ✅ Yes (REST federated read/write) | ❌ No | ✅ Yes (cross-workspace) |
Namespaces support | ✅ Hierarchical nesting | ✅ Branch-style | ✅ Namespace/schema support |
Admin APIs | ✅ REST + OAuth-managed | ⚠️ CLI tools | ✅ Web UI + API |
Polaris and Unity Catalog offer strong multi-tenant and federated architecture. Nessie has branching but lacks formal tenancy constructs.
Unity Catalog (now fully open-source) excels in governance, lineage, and multi-format support, especially within Databricks environments.
Polaris focuses on REST-based interoperability, credential vending, and flexible deployment across clouds without binding to a single platform.
Nessie is ideal if table version control and Git-like workflows on Iceberg are your primary priorities.
Your best choice depends on your data formats, compute engines, governance needs, and whether you're tied to Databricks, prefer BYOC, or favor Iceberg-first versioning workflows.
Apache Polaris™ (Incubating) is a robust, open-source metadata catalog tailored for Apache Iceberg. Implementing the Iceberg REST Catalog API, Polaris enables consistent discovery, schema evolution, and atomic operations across engines like Spark, Flink, Trino, and StarRocks—without moving or copying Parquet data. It bridges the gap left by storage-only systems like S3 and Hive, offering essential catalog capabilities: logical table mapping, snapshot management, and role-based access control. Polaris’s innovations—scoped credential vending, internal/external catalog support, and multi-tenant admin APIs—allow it to serve as both a secure gatekeeper and a flexible governance layer across multi-engine environments .
In practice, Polaris enforces secure control over metadata and storage access, promotes format interoperability (Iceberg, Delta, Parquet, Hudi), supports gradual migrations, and simplifies data sharing with partners—all while enabling teams to use their preferred compute engines under one unified catalog. It’s a practical, open-source tool for modern, cloud-native, and cloud-agnostic lakehouses.
A: Polaris implements the Iceberg REST API and is officially compatible with engines such as Apache Spark, Flink, Trino, Apache Doris, and StarRocks.
A: Yes. Unity Catalog was open-sourced under the Apache 2.0 license on June 12, 2024, and donated to the LF AI & Data Foundation .
A: Snowflake supports reading from Polaris via Open Catalog. Full write support is in preview or upcoming integration stages .
A: Polaris issues short-lived, scoped credentials for systems like S3 or GCS when RBAC authorizes an operation—ensuring secure access only to permitted directories .
A: Yes. Polaris supports generic external tables, so you can register Delta and Hudi tables for read-only usage alongside Iceberg tables .
A: Polaris uses realms, OAuth-secured admin APIs, and multiple internal/external catalogs to maintain strong multi-tenant separation .
A: Yes. Unity Catalog OSS supports Iceberg REST API and integrates with libraries like Delta Lake UniForm to enable cross-engine metadata access .
A:
Choose Polaris if you require REST-based interoperability across multiple engines and open formats with flexible deployment options.
Choose Unity Catalog if you're heavily invested in Databricks and need built-in governance, lineage, and policy enforcement across formats.
Use Nessie if your workloads are fully Iceberg-based and depend on branches and table versioning.