Apache Iceberg

What Is Apache Iceberg?

Apache Iceberg is an open-source table format designed for large-scale, complex datasets that span petabytes of data. Originating as a solution to manage massive tables efficiently at Netflix, it was open-sourced under the Apache Incubator in 2018 and graduated in 2020.
Apache Iceberg emerges as a sophisticated open table format nestled between computational engines (like Flink and Spark) and storage formats (such as ORC, Parquet, and Avro). It serves as a middleware layer that abstracts the complexity of data storage formats beneath and presents a unified, table-like semantic to computational frameworks above. This design allows for flexible data operations and schema management across different computing environments without binding to any specific storage engine, enabling expansion across HDFS, S3, OSS, and more.

 

Key Aspects of Iceberg Table Format

  • Intermediary Layer Functionality Iceberg's Table Format acts as a middle layer, managing files on the storage system below while providing rich interfaces to the computation layer above. This structure ensures files stored on systems, like HDFS, are organized in a manner that includes partitioning, data storage formats, compression formats, and directory information within HDFS. All these details are maintained in a Metastore, which can be considered a form of file organization format.

  • File Organization A well-designed file organization format like Iceberg enhances the efficiency of computational layers accessing disk files, facilitating operations such as listing, renaming, or searching. This efficiency stems from Iceberg's clear distinction between the physical representation of tables (rows and columns) and the abstract database system implementation, which includes the table's schema, file organization (partitioning method), metadata (statistical information, index information), and read/write APIs.

  • Schema and File Organization Iceberg's schema defines the supported field types, ranging from basic types like integers and strings to complex data types. The organization of files within tables typically follows partitioning patterns, either range or hash partitioning, crucial for optimizing data access and management.

Metadata Management in Iceberg

 

The architecture of an Apache Iceberg table

 

Iceberg employs a layered approach to data management, distinguishing between the metadata management layer and the data storage layer. The metadata management is further divided into three key components:

  • Metadata File: Stores the current version's metadata, including all snapshot information.
  • Snapshot: Represents a specific operation's snapshot, with each commit generating a new snapshot containing multiple manifests detailing the addresses of the generated data files.
  • Manifest: Lists the data files associated with a snapshot, providing a comprehensive view of the data's organization and facilitating efficient data retrieval and modification.

At its core, Iceberg aims to track all changes to a table over time through snapshots, which represent complete collections of table data files at any given moment. Each update operation generates a new snapshot, ensuring data consistency and facilitating historical data analysis and incremental reads.  

 

Exploring the Benefits of Apache Iceberg

Comprehensive Compute Engine Support

Iceberg's superior kernel abstraction ensures that it is not tied to any specific compute engine, providing broad support for popular processing frameworks like Spark, Flink, and Hive. This flexibility allows users to integrate Iceberg seamlessly into their existing data infrastructure, leveraging the native Java API for direct access to Iceberg tables without being constrained by the choice of computation engine.

Flexible File Organization

Iceberg introduces innovative data organization strategies that support both stream-based incremental and batch full-table computation models. This versatility ensures that both batch tasks and streaming tasks can operate on the same storage model, such as HDFS or OZONE—a next-generation storage engine developed by the Hadoop community. By enabling hidden partitioning and partition layout evolution, Iceberg facilitates easy updates to data partitioning strategies, supporting a variety of storage formats including Parquet, ORC, and Avro. This approach not only eliminates data silos but also aids in building cost-effective, lightweight data lake storage services.

Optimized Data Ingestion Workflow

With its ACID transaction capabilities, Iceberg ensures that newly ingested data is immediately visible, significantly simplifying the ETL process by eliminating the impact on current data processing tasks. The platform's support for upserts and merge into operations at the row level further reduces the latency involved in data ingestion, streamlining the overall data flow into the data lake.

Incremental Read Capabilities

One of Iceberg's standout features is its support for reading incremental data in a streaming fashion, enabling a tight integration with mainstream open-source compute engines for both data ingestion and analysis. This feature is complemented by built-in support for Spark Structured Streaming and Flink Table Source, allowing for sophisticated data analysis workflows. Additionally, Iceberg's ability to perform historical version backtracking enhances data reliability and auditability, offering valuable insights into data evolution over time.

 

Where and When to Use Apache Iceberg

As one of the core components of a universal data lake solution, Iceberg is primarily suitable for the following scenarios:

Real-time Data Import and Querying

Data flows in real-time from upstream to the Iceberg data lake, where it can be immediately queried. For example, in logging scenarios, Iceberg or Spark streaming jobs are initiated to import log data into Iceberg tables in real-time. This data can then be queried in real-time using Hive, Spark, Iceberg, or Presto. Moreover, Iceberg's support for ACID transactions ensures the isolation of data inflow and querying, preventing dirty data.

Data Deletion or Updating

Most data warehouses struggle to efficiently perform row-level data deletions or updates, typically requiring offline jobs to extract the entire table's raw data, modify it, and then write it back to the original table. Iceberg, however, narrows the scope of changes from the table level to the file level, allowing for localized changes to execute business logic for data modifications or deletions. Within the Iceberg data lake, you can directly execute commands like DELETE FROM test_table WHERE id > 10 to make changes to the data in the table.

Data Quality Control

With the validation function of the Iceberg Schema, abnormal data is excluded during data import, or further processing is performed on abnormal data.

Data Schema Changes

The schema of the data is not fixed and can change; Iceberg supports making changes to the table structure using Spark SQL's DDL statements. When changing the table structure in Iceberg, it is not necessary to re-export all historical data according to the new schema, which greatly speeds up the schema change process. Additionally, Iceberg's support for ACID transactions effectively isolates schema changes from affecting existing read tasks, allowing you to access consistently accurate data.

Real-time Machine Learning

In machine learning scenarios, a significant amount of time is often spent processing data, such as cleaning, transforming, and extracting features, as well as dealing with both historical and real-time data. Iceberg simplifies this workflow, turning the entire data processing process into a complete, reliable real-time stream. Operations like data cleaning, transformation, and feature engineering are all node actions on the stream, eliminating the need to handle historical and real-time data separately. Furthermore, Iceberg also supports a native Python SDK, making it very user-friendly for developers of machine learning algorithms.


What Problem Does Apache Iceberg Solve?

Apache Iceberg solves several critical problems in big data management, including:
  • Handling large-scale datasets efficiently in a cloud-native environment.
  • Providing strong consistency and atomic operations for data lakes, a challenge in eventually consistent storage systems.
  • Simplifying data engineering tasks with schema evolution, partition evolution, and straightforward data recovery mechanisms.
  • Enhancing query performance with optimized metadata management and support for a wide range of data processing engines.
Apache Iceberg offers a comprehensive solution for managing large-scale data lakes with complex data structures, evolving schemas, and the need for robust data integrity and performance. Its design philosophy and features make it a compelling choice for modern data architectures, balancing flexibility with efficiency and reliability.

 

Comparison: Apache Iceberg, Apache Hudi, and Delta Lake,

 

delta lake_iceberg_hudi_apache_data lake

  • Apache Iceberg: Stands out for its broad engine compatibility (including Spark, Flink, Presto, and Trino), making it highly versatile. Iceberg's design is optimized for cloud storage efficiency and performance at petabyte scale, addressing issues like metadata scalability and file listing overheads inherent in cloud object stores.
  • Apache Hudi: Provides similar capabilities for record-level inserts, updates, and deletes. Hudi is designed to offer faster incremental processing and comes with built-in support for change data capture (CDC), stream processing, and data privacy use cases.
  • Delta Lake: Like Iceberg, Delta Lake offers ACID transactions, schema evolution, and time travel. However, it is closely integrated with Spark and Databricks, potentially limiting its flexibility with other processing engines.

 

Apache Iceberg + StarRocks

Explore how industry leaders are transforming their data lakes and lakehouses with Iceberg and StarRocks through our curated case studies. Get direct insights into their design strategies and the benefits they're achieving with these technologies.

  • A leading social media company has shortened its development cycle and improved cost-effectiveness for its trillions of daily records of data by switching to a data lakehouse architecture. Read the case study.

  • A gaming giant is reducing storage costs by 15x while eliminating all pre-aggregations through unifying all workloads on its data lakehouse. Read the case study.

  • An AB testing SAAS platform is unifying its demanding customer-facing workloads on the data lakehouse. Read the case study.

 

CelerData Cloud Tabular Managed Iceberg

 

 

If you are looking for enterprise-standard features, dedicated support, or just want to these benefits on the cloud, check out our StarRocks-powered CelerData Cloud. Sign up at cloud.celerdata.com for a free trial and see what it can do for you.