Iceberg employs a layered approach to data management, distinguishing between the metadata management layer and the data storage layer.
Data Layer: This layer contains the actual data files, such as Parquet and ORC files, which are the physical carriers for storing data.
Metadata Layer: The metadata layer in Iceberg is multi-tiered and is responsible for storing the structure of the table and the indexing information of the data files.
Catalog Layer: This layer stores pointers to the locations of the metadata files, serving as an index for the metadata. Iceberg provides several catalog implementations, such as:
The metadata management is further divided into three key components:
At its core, Iceberg aims to track all changes to a table over time through snapshots, which represent complete collections of table data files at any given moment. Each update operation generates a new snapshot, ensuring data consistency and facilitating historical data analysis and incremental reads.
CREATE TABLE catalog.MyTable (..., ts TIMESTAMP) PARTITIONED BY days(ts)
to create partitions based on daily timestamps. New data is automatically placed in the corresponding partitions.Iceberg's superior kernel abstraction ensures that it is not tied to any specific compute engine, providing broad support for popular processing frameworks like Spark, Flink, and Hive. This flexibility allows users to integrate Iceberg seamlessly into their existing data infrastructure, leveraging the native Java API for direct access to Iceberg tables without being constrained by the choice of computation engine.
Iceberg introduces innovative data organization strategies that support both stream-based incremental and batch full-table computation models. This versatility ensures that both batch tasks and streaming tasks can operate on the same storage model, such as HDFS or OZONEāa next-generation storage engine developed by the Hadoop community. By enabling hidden partitioning and partition layout evolution, Iceberg facilitates easy updates to data partitioning strategies, supporting a variety of storage formats including Parquet, ORC, and Avro. This approach not only eliminates data silos but also aids in building cost-effective, lightweight data lake storage services.
With its ACID transaction capabilities, Iceberg ensures that newly ingested data is immediately visible, significantly simplifying the ETL process by eliminating the impact on current data processing tasks. The platform's support for upserts and merge into operations at the row level further reduces the latency involved in data ingestion, streamlining the overall data flow into the data lake.
One of Iceberg's standout features is its support for reading incremental data in a streaming fashion, enabling a tight integration with mainstream open-source compute engines for both data ingestion and analysis. This feature is complemented by built-in support for Spark Structured Streaming and Flink Table Source, allowing for sophisticated data analysis workflows. Additionally, Iceberg's ability to perform historical version backtracking enhances data reliability and auditability, offering valuable insights into data evolution over time.
As one of the core components of a universal data lake solution, Iceberg is primarily suitable for the following scenarios:
Data flows in real-time from upstream to the Iceberg data lake, where it can be immediately queried. For example, in logging scenarios, Iceberg or Spark streaming jobs are initiated to import log data into Iceberg tables in real-time. This data can then be queried in real-time using Hive, Spark, Iceberg, or Presto. Moreover, Iceberg's support for ACID transactions ensures the isolation of data inflow and querying, preventing dirty data.
Most data warehouses struggle to efficiently perform row-level data deletions or updates, typically requiring offline jobs to extract the entire table's raw data, modify it, and then write it back to the original table. Iceberg, however, narrows the scope of changes from the table level to the file level, allowing for localized changes to execute business logic for data modifications or deletions. Within the Iceberg data lake, you can directly execute commands like DELETE FROM test_table WHERE id > 10
to make changes to the data in the table.
With the validation function of the Iceberg Schema, abnormal data is excluded during data import, or further processing is performed on abnormal data.
The schema of the data is not fixed and can change; Iceberg supports making changes to the table structure using Spark SQL's DDL statements. When changing the table structure in Iceberg, it is not necessary to re-export all historical data according to the new schema, which greatly speeds up the schema change process. Additionally, Iceberg's support for ACID transactions effectively isolates schema changes from affecting existing read tasks, allowing you to access consistently accurate data.
In machine learning scenarios, a significant amount of time is often spent processing data, such as cleaning, transforming, and extracting features, as well as dealing with both historical and real-time data. Iceberg simplifies this workflow, turning the entire data processing process into a complete, reliable real-time stream. Operations like data cleaning, transformation, and feature engineering are all node actions on the stream, eliminating the need to handle historical and real-time data separately. Furthermore, Iceberg also supports a native Python SDK, making it very user-friendly for developers of machine learning algorithms.
Despite Apache Hive's significant role as an early data warehouse solution, it has inherent performance and flexibility limitations that pose challenges for data processing and analysis in practical applications.
StarRocks is the ultimate tool for accelerating queries on Iceberg. It efficiently analyzes data stored locally and can also serve as a computing engine to directly analyze data in data lakes. Users can effortlessly query data stored on Iceberg using the External Catalog provided by StarRocks, eliminating the need for data migration. StarRocks supports reading Iceberg v1 and v2 tables (including position delete and equality delete) and also allows writing data into Iceberg tables.
A query's efficiency relies on metadata retrieval and parsing, execution plan generation, and efficient execution. In scenarios involving Iceberg queries, StarRocks leverages its native vectorized engine and CBO optimizer capabilities and introduces additional optimizations for querying data on lakes. This ensures that queries are executed swiftly and efficiently, making StarRocks a powerful tool for handling Iceberg data.
Explore how industry leaders are transforming their data lakes and lakehouses with Iceberg and StarRocks through our curated case studies. Get direct insights into their design strategies and the benefits they're achieving with these technologies.
A leading social media company has shortened its development cycle and improved cost-effectiveness for its trillions of daily records of data by switching to a data lakehouse architecture. Read the case study.
A gaming giant is reducing storage costs by 15x while eliminating all pre-aggregations through unifying all workloads on its data lakehouse. Read the case study.
An AB testing SAAS platform is unifying its demanding customer-facing workloads on the data lakehouse. Read the case study.
If you are looking for enterprise-standard features, dedicated support, or just want to these benefits on the cloud, check out our StarRocks-powered CelerData Cloud. Sign up at cloud.celerdata.com for a free trial and see what it can do for you.