Apache Paimon

What is Apache Paimon?

Apache Paimon, originally launched as Flink Table Store (FTS) in January 2022, was developed by the Apache Flink community to combine Flink’s real-time streaming capabilities with the Lakehouse architecture. The goal was to enable real-time data flow in data lakes, offering users a unified experience for both real-time and batch processing. This vision laid the groundwork for a next-generation platform for seamless stream-batch integration.

On March 12, 2023, the project entered the Apache Software Foundation (ASF) incubator and was renamed Apache Paimon. Over time, it developed into a platform that supports both streaming and batch data workflows with highly efficient, real-time analytics. Paimon’s progress and innovation helped it establish a strong foothold in the realm of real-time data lakes.

During its incubation, Apache Paimon addressed several challenges through technical innovation, delivering the following advancements:

  • Enhanced Real-Time Ingestion: Paimon offers tools for syncing real-time data from databases like MySQL, handling large-scale data ingestion with low latency. Schema changes are automatically synchronized, ensuring efficient, real-time updates in the data lake.
  • Unified Stream-Batch Processing: Paimon integrates with Flink for streaming and Spark for batch processing, allowing consistent handling of both workflows in a unified storage system. This reduces complexity, improves performance, and optimizes resource usage.
  • Ecosystem Integration: Paimon supports major open-source data tools, including Flink, Spark, Hive, Trino, StarRocks, Presto, and Doris. This broad integration enables seamless computation and storage within a single platform.
  • Advanced Lakehouse Storage: Paimon introduces features like Deletion Vectors and indexing to enhance query performance and support varied workloads, including streaming, batch, and OLAP, with low-latency querying.

Key Capabilities of Apache Paimon

  • Stream and Batch Support: Handles stream updates, batch reads/writes, and change log generation in a unified platform.

  • Data Lake Capabilities: Offers low-cost, highly reliable, and scalable metadata, with all the advantages of data lake storage.

  • Versatile Merge Engines: Provides flexible record update methods, allowing you to choose between retaining the last record, performing partial updates, or aggregating records.

  • Change Log Generation: Generates accurate and complete change logs from any data source, simplifying stream analytics.

  • Rich Table Types: In addition to primary key tables, Paimon supports append-only tables, offering ordered stream reads as an alternative to message queues.

  • Schema Evolution: Fully supports schema evolution, allowing you to rename columns and reorder them as needed.

 

How Apache Paimon Works

 

Architecture of Apache Paimon

Architecture of Apache PaimonSource: paimon.apache.org

Apache Paimon brings a new approach to lake storage by combining LSM (Log-Structured Merge) trees with columnar formats like ORC and Parquet. This combination allows Paimon to support large-scale real-time updates within data lakes. The LSM file organization structure in Paimon is as follows:

Architecture of Apache Paimon 2Source: paimon.apache.org

Read/Write Capabilities: Paimon offers flexibility in how data is read and written:

  • For reading: You can access data from historical snapshots in batch mode, retrieve the latest data in streaming mode, or use a hybrid approach that reads incremental snapshots.
  • For writing: Paimon supports streaming synchronization from database change logs (CDC) and bulk insertions or overwrites from offline data sources.

Internal Structure: At its core, Paimon stores data in columnar files within a file system or object storage and uses an LSM tree structure to manage large-scale updates and ensure high-performance queries.

Unified Storage

In streaming engines like Apache Flink, you typically encounter three types of connectors:

  • Message Queues: Systems like Apache Kafka are used to ensure low-latency data flow at both the source and intermediary stages of a pipeline.
  • OLAP Systems: Tools like ClickHouse receive and process data in streams, making it available for ad-hoc queries.
  • Batch Storage: Systems like Apache Hive support traditional batch processing operations, including INSERT OVERWRITE.

Paimon simplifies this by providing a table abstraction that behaves similarly to a traditional database:

  • In batch mode: Paimon functions like a Hive table, supporting various batch SQL operations and querying the latest snapshot.
  • In streaming mode: It acts like a message queue, allowing you to query it for continuous updates from historical data that never expires.

Basic Concepts

 

Snapshots

A snapshot captures the state of a table at a specific moment. You can use the latest snapshot to access the most current data, or you can explore previous states through earlier snapshots using time travel.

Partitions

Paimon uses the same partitioning strategy as Apache Hive, dividing tables into sections based on specific column values (e.g., date, city, department). This partitioning helps in managing and querying data more efficiently. If a table has a primary key, the partition key must be a subset of the primary key.
Buckets

Tables, whether partitioned or not, are further divided into buckets. Buckets are determined by hashing one or more columns, and they help organize the data for faster queries. If you don’t specify a bucket key, Paimon uses the primary key (if defined) or the entire record as the bucket key. The number of buckets controls the level of parallelism in processing, but too many buckets can lead to small files and slower reads. Ideally, each bucket should contain around 1GB of data.

Consistency Guarantees

Paimon uses a two-phase commit protocol to ensure that a batch of records is consistently written to the table. Each commit creates up to two snapshots. If multiple writers are modifying a table at the same time, as long as they’re working on different buckets, their changes are processed in sequence. If they modify the same bucket, snapshot isolation is maintained, meaning that while the final table may reflect a mix of changes, no data is lost.

 

Core Use Cases of Paimon

Paimon currently focuses on three key use cases:

  • CDC Data Ingestion into Data Lakes

    Paimon is optimized to make this process simpler and more efficient. With one-click ingestion, you can bring entire databases into the data lake, dramatically reducing the complexity of the architecture. It supports real-time updates and fast querying at a low cost. Additionally, it offers flexible update options, allowing specific columns or different types of aggregate updates to be applied.

  • Building Streaming Pipelines

    Paimon can be used to build full streaming pipelines with key features like:

    • Generating change logs, which allows streaming reads to access fully updated records—making it easier to build robust streaming pipelines.
    • Paimon is also evolving to act like a message queue with a consumer mechanism. The latest version introduces lifecycle management for change logs, letting you define how long they are retained, similar to Kafka (e.g., logs can be stored for a day or more). This creates a lightweight, low-cost streaming pipeline solution.
  • Ultra-Fast OLAP Queries

    While the first two use cases ensure real-time data flow, Paimon also supports high-speed OLAP queries for analyzing stored data. By combining Z-Order and indexing, Paimon enables fast data analysis. Its ecosystem supports various query engines like Flink, Spark, StarRocks, and Trino, all of which can efficiently query data stored in Paimon.

Apache Paimon + StarRocks

StarRocks + Apache Paimon supports several key use cases:

Apache Paimon_StarRocksSource: Using Apache Paimon + StarRocks High-speed Batch and Streaming Lakehouse Analysis

 

Core Scenarios and Technical Principles

  • Trino Compatibility: StarRocks achieves 90% Trino SQL compatibility, allowing organizations to migrate existing Trino jobs without changing SQL code. By setting sql_dialect = "trino", users can run Trino SQL natively on StarRocks. This feature has already been applied in production environments with partners and customers.
  • Federated Analytics: StarRocks + Apache Paimon supports complex queries that join multiple data sources and table formats. For instance, users can query Apache Paimon data sources by creating external catalogs and joining them with other tables, including internal StarRocks tables. This enables seamless data integration and federated queries across different systems.
  • Transparent Acceleration: Through its materialized view feature, StarRocks can automatically accelerate external table queries by building materialized views of the required tables and partitions in its native format. This optimization is applied without user intervention. SQL queries are automatically rewritten to leverage these materialized views, solving issues where BI tools or SQL queries are difficult to tune manually.
  • Data Modeling: StarRocks’ materialized views can be nested to create complex data models across multiple layers (ODS, DWD, DWS, ADS). Data lakes like Apache Paimon are used at the base (ODS layer), with materialized views building higher-level layers. This nested approach simplifies data modeling and accelerates querying across the data pipeline, making StarRocks ideal for large-scale analytics.
  • Hot and Cold Data Fusion: StarRocks materialized views support a TTL (Time-To-Live) feature, allowing users to define how long data remains materialized. For instance, partitions for the last three months (hot data) can be materialized, while older partitions (cold data) are queried from the data lake. StarRocks automatically handles the query process, accessing materialized data for hot queries and external table data for cold queries, without requiring users to differentiate between them in their SQL.
  • JNI Connector: This connector allows StarRocks (written in C++) to interact with Java-based data lake formats like Apache Paimon, Hudi, AVRO, and RCFile. By abstracting the data conversion process, the JNI Connector simplifies the integration of new data formats, reducing the complexity of adding support for Java-based systems. This improves the flexibility and extensibility of StarRocks in working with various data lakes.

Performance Testing of StarRocks + Apache Paimon Lakehouse Analysis

Apache Paimon_StarRocks_benchmark

 

The performance of StarRocks + Apache Paimon was evaluated using the TPCH benchmark with 100 GB of data, running queries on a test environment configured with one master node and three core nodes, each with 16 vCPUs and 64 GB of memory. The test used the following configuration:

  • Test environment: EMR-5.15.1 version with 1 master node and 3 core nodes.
  • Software: Trino version 422 and StarRocks' main branch.
  • Memory settings: Both Trino and StarRocks were configured to use 50 GB of memory.

During testing, each TPCH query was run three times, with the average execution time taken without pre-analysis or warming up the system. The data was stored on HDFS.

Results: The combination of StarRocks’ optimized kernel and Apache Paimon’s data format provided significant performance gains, with the analysis capability of StarRocks + Apache Paimon delivering up to 15 times faster query performance compared to previous versions.

 

Conclusion

The blog covered the essential aspects of Apache Paimon, including its architecture, features, and setup process. Apache Paimon stands out in data streaming and storage due to its high-throughput data writing and low-latency queries. The tool's compatibility with common computing engines like Alibaba Cloud E-MapReduce (EMR) enhances its utility. Apache Paimon significantly improves data processing frameworks and simplifies large-scale data operations. Readers are encouraged to explore and experiment with Apache Paimon to unlock its full potential in real-time data processing and analytics.