What is Apache Iceberg?

Apache Iceberg is an open-source table format designed to manage large-scale datasets in data lakes. It was developed to address some of the limitations of existing table formats like Apache Hive, particularly in handling large amounts of data efficiently and consistently.

instantly

Reliable Data Consistency

Iceberg supports ACID transactions, ensuring data integrity and reliability even in complex, concurrent data operations

unify-icon

Non-Disruptive Schema Evolution

Iceberg allows schema changes to be applied incrementally, meaning you can add, remove, or rename columns without impacting ongoing operations.

real-time-icon

Improved Query Performance

With features like data pruning and partition awareness, Iceberg significantly reduces the amount of data scanned during queries, leading to faster queries, especially on large datasets.

Key Iceberg Features

Popular worldwide, Iceberg boasts a robust set of features that make it the leading choice as a lakehouse table format.

Schema Evolution

Seamlessly supports schema changes without requiring data redistribution or downtime.

Partition Evolution

Dynamically adapts partitions to optimize query performance without impacting existing data.

ACID Transactions

Ensures data integrity and consistency through robust support for ACID transactions.

Time Travel

Enables querying of historical data versions, providing easy access to past states of the dataset.

Efficient Data Pruning

Minimizing the amount of data scanned through its tree of metadata, enhancing performance.

Iceberg Use Case - Gaming

Modern games produce massive amounts of data; storage plays a crucial role in not only keeping the cost low but also eliminating data silos so everyone can take full advantage of the data across teams and even studios.

 

Learn how Tencent Gaming reduces storage costs by 15x while eliminating all pre-aggregations through unifying all workloads on Apache Iceberg. Read the case study.

 

Apache Iceberg Alternatives

Iceberg not for you? Check out these other popular options for open lakehouse table formats.
Hudi Logo

Apache Hudi

Known for its real-time data ingestion and upsert capabilities, Hudi excels in scenarios requiring low-latency data updates and fast data ingestion with support for incremental data processing.

Apache_Paimon_Logo-removebg-preview

Apache Paimon

Paimon offers strong support for real-time streaming data and dynamic schema evolution, making it a good choice for environments where continuous data updates and schema changes are frequent.

Delta Lake Logo

Delta Lake

Delta Lake is widely recognized for its strong ACID transaction support and seamless integration with the Apache Spark ecosystem, making it a powerful choice for those already invested in Spark for large-scale data processing.

Improving Iceberg Performance

Get the best possible experience from your Iceberg deployment with these optimization tips.

Metadata Rewrite

Optimize metadata files to enable faster query planning and more efficient data pruning, reducing the amount of data scanned during queries.

Data Compaction

Routinely compact small files into larger ones to speed up data scanning and improve overall query performance by reducing the overhead of handling numerous small files.

Use Iceberg Features

Utilize Iceberg's features, such as hidden partitioning, to more easily optimize data layout and improve query performance based on your specific use case and workload.

Optimize Like a Data Warehouse

Apply traditional data warehouse optimizations like partitioning, sorting, and clustering within Iceberg to tailor data storage and access patterns for faster query execution.

Choose the Right Query Engine

Select a query engine that is well-integrated with Iceberg and suits your use case. Instead of trying to make low-latency queries work with Spark, use a query engine such as StarRocks that suits the task.

Use Cases: Query Engines for Data Lakes

Adopting Iceberg can be a great first step towards building a world-class lakehouse architecture, but rhe right query engines can also make or break your lakehouse's changes of success. Here are several examples.
WeChat-Logo.wine

SOCIAL

A leading social media company has shortened its development cycle and improved cost-effectiveness for its trillions of daily records of data by switching to a data lakehouse architecture.

 

 

trip.com@2x

TRAVEL

Trip.com has ditched its data warehouse with a data lakehouse query engine and is now experiencing 10x better query performance.

 

 

AT Renew Logo

E-COMMERCE

An environmental production company 10xed the cost-effectiveness of its analytical system by switching to a modern open-source data lakehouse query engine.

 

 

Tencent-Logo-1998

SOFTWARE

Tencent's AB testing SAAS platform ABetterChoice is unifying its demanding customer-facing workloads on the data lakehouse.

 

How CelerData Enhance Apache Iceberg

Powered by StarRocks, CelerData delivers the best analytics performance and scalability on the market. Here's how we're able to do it.

Massively Parallel Processing

CelerData's massively parallel processing (MPP) architecture with in-memory data shuffling prevents a single node from bottlenecking the entire system, allowing for near-linear scale, especially for JOINs and complex aggregation queries.

Vectorized Query Execution

Besides being written in C++ and fully SIMD optimized, CelerData's vectorized query executions deliver the industry's fastest query performance on top of Apache Iceberg.

Caching Framework

CelerData's caching framework implements a metadata cache and data cache based on memory and disk to overcome the expensive overhead of retrieving data from remote storage locations.

Intelligent Materialized View

Engineered to be built on demand to accelerate slow queries without external processing tools. CelerData's query rewrite capability enables MVs to be built anytime without manually modifying your SQL.

Distributed Iceberg Metadata Retrieval

CelerData distributes the job planning and metadata retrieval tasks across compute nodes. By parallelizing the processing of manifest files, StarRocks speeds up and scales the query planning process.