Parquet File Format
Parquet File Format
Join StarRocks Community on Slack
Connect on SlackTABLE OF CONTENTS
Publish date: May 17, 2024 12:45:47 PM
What is Parquet?
Parquet is a columnar storage format optimized for analytical querying and data processing. Each column's data is compressed using a series of algorithms before being stored, avoiding redundant data storage and allowing queries to involve only the necessary columns. This significantly improves query efficiency.
Apache Parquet, the widely adopted open-source implementation, was initiated by Twitter and Cloudera and graduated from the Apache incubator as a top-level project in May 2015. It supports nested data formats, inspired by Google’s Dremel paper, enabling efficient storage and querying of complex data structures.
How Parquet Works and Why Parquet is So Fast?
To understand how Parquet works, let’s use a simple eCommerce data table as an example:
order_id | product_name | category | price | purchase_date |
1 | Laptop | Electronics | 999.99 | 1/1/2023 |
2 | Smartphone | Electronics | 699.99 | 1/2/2023 |
3 | Coffee Maker | Kitchen | 49.99 | 1/3/2023 |
Comparing Row-Oriented Storage
Before delving into Parquet, it's helpful to understand how traditional row-oriented storage formats like CSV or JSON work. In these formats, data is stored row by row. For instance, storing the above table in CSV would look like this:
order_id,product_name,category,price,purchase_date
1,Laptop,Electronics,999.99,2023-01-01
2,Smartphone,Electronics,699.99,2023-01-02
3,Coffee Maker,Kitchen,49.99,2023-01-03
In row-oriented storage, each row's data is stored in a contiguous block, making it easy to read and write entire records. This approach suits transactional workloads where operations typically involve individual record insertions, updates, or deletions.
Parquet's Multi-Level Structure and File Structure
Parquet is a columnar storage format that organizes data into multiple hierarchical levels: row groups, columns, and pages. This structure enhances performance and efficiency for big data processing tasks.
HDFS Block
An HDFS block is the smallest unit of storage in HDFS. Each block is stored as a local file on the filesystem and multiple replicas are maintained across different machines. Typically, block sizes are 256 MB or 512 MB.
HDFS File
An HDFS file includes both data and metadata, distributed across multiple blocks. Each file in HDFS consists of multiple blocks, which are the fundamental units of storage and replication.
Row Groups
A row group is the largest unit of data in a Parquet file. It divides the entire table into blocks, each containing a configurable number of rows. For example, consider an eCommerce table divided into two row groups, each holding different sets of rows. The size of row groups significantly impacts performance:
-
Larger Row Groups: Offer better compression and more efficient I/O operations but may increase the amount of data read when querying subsets. Parquet recommends row groups sized between 512MB and 1GB to align with HDFS block sizes for optimal performance.
-
Smaller Row Groups: Improve query performance for specific rows but may reduce compression efficiency.
Typical row group sizes in Parquet implementations are around 128 MB by default but can be adjusted based on the specific use case. Each row group is loaded entirely into memory during read operations. For schemas with smaller record sizes, more rows can be stored per row group.
Column Chunk
Within each row group, data for each column is stored in a separate column. Each column's data is further divided into pages, the smallest unit in Parquet. Pages are where encoding and compression are applied. Here's how the eCommerce table might be stored:
-
Column "order_id": 1, 2, 3
-
Column "product_name": Laptop, Smartphone, Coffee Maker
-
Column "category": Electronics, Electronics, Kitchen
-
Column "price": 999.99, 699.99, 49.99
-
Column "purchase_date": 2023-01-01, 2023-01-02, 2023-01-03
Since columns store data of the same type together, Parquet can use various optimizations to store this data efficiently in binary format.
Pages
Each column chunk is further divided into pages, the smallest unit of storage in Parquet. Pages are the basic unit of encoding and compression. Different pages within the same column chunk can use different encoding techniques. Smaller pages allow for finer-grained access, but larger pages reduce storage overhead and parsing costs. An 8KB page size is typically recommended.
Structure of a Parquet File
A Parquet file consists of three main sections: Header, Data Blocks, and Footer. Each file starts and ends with a magic number "PAR1" to identify it as a Parquet file.
Header
The header contains the initial magic number "PAR1".
Data Blocks
The data blocks section is where the actual data is stored. It is composed of multiple row groups, each containing a batch of data. For instance, if a file has 1000 rows, it might be split into two row groups, each containing 500 rows. Within each row group, data is organized by column, with all data for a single column stored together in a column chunk. Each column chunk is further divided into pages, classified as data pages or dictionary pages. This hierarchical design allows:
-
Parallel loading of multiple row groups.
-
Efficient columnar storage and retrieval through column chunks.
-
Fine-grained data access via pages.
Footer
The footer contains critical metadata about the file. It includes the file schema, and metadata for each row group and column. The footer also includes the length of the footer itself (Footer Length) and ends with the magic number "PAR1" again for validation.
The file footer structure includes:
-
File Metadata: Describes the schema and contains metadata for each row group.
-
Footer Length: A 4-byte integer indicating the size of the footer section.
-
Magic Number: "PAR1" indicating the end of the Parquet file.
Encoding and Compression Techniques in Parquet
Parquet employs a variety of encoding and compression techniques to optimize storage and enhance performance. These techniques ensure that data is stored efficiently, reducing the space required while maintaining quick access capabilities.
Encoding Techniques
-
Dictionary Encoding: Suitable for columns with a limited number of distinct values, this technique creates a dictionary of unique values and replaces the actual data with dictionary indexes. For example, if a "category" column contains values like Electronics and Kitchen, it uses a dictionary to store these values and indexes to refer to them, significantly reducing storage size.
-
Run-Length Encoding (RLE): Effective for columns with many repeated values, RLE stores the value and the count of its consecutive occurrences. For instance, if the "price" column has repeated prices, RLE stores the price once along with the count of repetitions, minimizing storage for repetitive data.
-
Bit Packing: Useful for columns with small integer values, bit packing minimizes storage by using the least number of bits needed to represent the values, rather than the standard 32 or 64 bits.
Compression Techniques
Parquet supports a range of compression algorithms to further reduce data size. Common lossless compression algorithms include Snappy, Gzip, Lzo, and Lz4, which typically reduce data size to 40-60% of its original size. These algorithms involve packing compressed bytes into blocks, which can be stored contiguously to save space and provide fast random access capabilities. When reading data, the reader only needs to decompress the necessary blocks rather than the entire file, enhancing read performance.
Combining Encoding and Compression
After applying the encoding techniques, Parquet applies compression algorithms to further reduce data size. The choice of compression algorithm depends on the trade-off between compression ratio and decompression speed:
-
Snappy: Balances compression speed and ratio, making it suitable for general-purpose use.
-
Gzip: Offers higher compression ratios but slower compression and decompression speeds, ideal for scenarios where space savings are more critical than speed.
-
Lzo and Lz4: Known for high decompression speed, suitable for real-time analytics where quick data access is essential.
Examples of Parquet’s Efficiency
To further illustrate Parquet’s efficiency in various data processing scenarios, let's dive into additional examples that showcase its advantages in real-world applications.
Example 1: Financial Data Analysis
Consider a financial services company that processes millions of transactions daily. Each transaction record includes multiple fields such as transaction ID, timestamp, account number, transaction type, amount, and currency.
-
Scenario: Analysts need to generate daily reports summarizing transaction amounts by type and currency.
-
Row-Oriented Storage: A row-oriented database would need to scan every row to extract the relevant columns (transaction type, amount, and currency), resulting in high I/O operations and slow query performance.
-
Parquet: With Parquet, only the required columns are read, significantly reducing I/O operations. For instance, reading only the 'transaction type', 'amount', and 'currency' columns from a dataset with millions of rows drastically cuts down the data volume processed, speeding up report generation.
Example 2: Log Data Storage and Retrieval
A web service logs millions of events daily, with each log entry including a timestamp, user ID, session ID, event type, and event details.
-
Scenario: Engineers need to analyze error events over the past month to identify patterns and troubleshoot issues.
-
Row-Oriented Storage: Retrieving error events would involve scanning all log entries, including those not relevant to the query.
-
Parquet: By storing logs in Parquet format, only the 'event type' and relevant 'event details' columns need to be read. Additionally, Parquet's compression can reduce the storage size of repetitive event types, making the data retrieval faster and more efficient.
These examples highlight how Parquet's columnar storage format and efficient compression mechanisms significantly enhance data processing performance:
-
Reduced I/O Operations: By reading only the required columns, Parquet minimizes disk I/O, which is often the bottleneck in data processing.
-
Improved Query Performance: Targeted data access leads to faster query execution, enabling real-time analytics and quicker insights.
-
Efficient Storage: Compression algorithms reduce the physical storage size, lowering costs and improving data retrieval speeds.
-
Scalability: Parquet’s design allows it to handle large-scale datasets efficiently, making it suitable for big data applications across various domains.
Parquet vs. ORC
When it comes to columnar storage formats in big data processing, Parquet and ORC (Optimized Row Columnar) are two of the most widely used options. Both offer significant performance benefits for analytical queries and data storage efficiency. However, there are differences in their design, features, and optimal use cases. This comparative analysis aims to highlight the key aspects of Parquet and ORC to help you choose the right format for your needs.
Optimized Row Columnar (ORC)
ORC is another columnar storage format optimized for Hadoop workloads. Developed by Hortonworks and now part of the Apache Hive project, ORC is designed to provide high-performance read, write, and storage efficiency. It includes features like lightweight indexes, support for complex types, and various compression options, making it effective for large-scale data processing in Hive and other Hadoop-based systems.
Key Differences Between Parquet and ORC
Compression and Storage Efficiency
- Parquet: Supports multiple compression algorithms, including Snappy, Gzip, Lzo, and Lz4. It uses efficient encoding techniques like dictionary encoding and run-length encoding (RLE) to minimize storage space.
- ORC: Also supports various compression algorithms, with Zlib and Snappy being commonly used. ORC provides highly optimized compression and has built-in support for data skipping, which helps in reducing the amount of data read during queries.
- Example: For a dataset with many repeated values, both Parquet and ORC can significantly reduce storage size, but ORC's built-in optimizations might offer better compression ratios and faster reads due to its lightweight indexing and data skipping features.
Schema Evolution
- Parquet: Supports schema evolution, allowing changes to the schema without requiring a complete rewrite of existing data. This is useful in dynamic environments where the data schema can change over time.
- ORC: Also supports schema evolution but with more complexity. ORC requires careful management of schema changes to ensure compatibility and efficient data access.
- Example: In a rapidly changing data environment, Parquet's simpler schema evolution might make it easier to manage updates without impacting existing data processing pipelines.
Performance
- Parquet: Optimized for read-heavy operations with its columnar storage format. It performs exceptionally well for analytical queries that require reading specific columns from large datasets.
- ORC: Designed for high-performance reads and writes. ORC's advanced indexing and data skipping capabilities can make it faster for certain types of queries, especially in Hive environments.
- Example: For a query that filters on specific columns and needs to read large amounts of data, both formats perform well. However, in a Hive-centric environment with complex queries, ORC might provide better performance due to its indexing and optimized read paths.
Integration and Ecosystem
- Parquet: Widely supported across many big data tools and frameworks, including Apache Spark, Apache Drill, Apache Impala, and more. Its language-agnostic nature makes it versatile for use in different programming environments.
- ORC: Primarily optimized for use with Apache Hive and other Hadoop-based tools. While it has good support in the Hadoop ecosystem, its usage outside this environment is less prevalent compared to Parquet.
- Example: In a multi-tool environment where data needs to be shared across different frameworks and languages, Parquet's broad compatibility might make it the preferred choice.
Write Costs
- Parquet: High write costs due to its columnar nature, making it less ideal for write-heavy workloads.
- ORC: Similar high write costs, also due to its columnar storage format.
- Example: In a financial trading system that logs thousands of transactions per second, both Parquet and ORC would face significant write costs. However, if the system also requires frequent updates and deletions, ORC might be more suitable due to its support for ACID transactions.
Read Performance
- Parquet: Optimized for read-heavy analytical workloads, suitable for scenarios requiring quick access to large datasets.
- ORC: Excels in write-intensive tasks and is well-suited for environments requiring frequent data modifications and ACID transactions.
- Example: For a marketing analytics platform using Parquet, quick retrieval and analysis of customer behavior data are crucial. On the other hand, a banking application using ORC can efficiently handle frequent updates and queries on transaction data, maintaining data integrity and performance.
Use Cases
- Parquet: Ideal for large datasets, analytics, and data warehousing, where efficient data retrieval and complex query performance are crucial.
- ORC: Best for write-intensive tasks and scenarios requiring ACID transactions, making it a strong choice for data warehouses that demand efficient storage and fast retrieval, particularly within Hive-based frameworks.
Best Practices for Using Parquet
-
Optimized I/O: When reading Parquet files, avoid using Python's
open
to return a file object. Instead, use file paths directly. This approach, along with memory mapping, can significantly improve I/O performance by reducing overhead and allowing more efficient access patterns. -
Efficient Encoding: Parquet uses dictionary encoding and run-length encoding (RLE) to boost performance. However, if the dataset is too large or contains too many unique values, the dictionary can exceed its size limit, falling back to plain encoding. This fallback reduces read/write performance and disk space savings. To mitigate this, adjust the
dictionary_page_size
parameter inwrite_table
to increase the dictionary size limit or reduce therow_group_size
to lower the column chunk size, thus decreasing the data volume encoded by each dictionary. -
Leveraging Metadata: Parquet metadata stores min, max, and count properties for each column chunk. During reads, Parquet uses this metadata to filter rows, skipping irrelevant groups and thereby reducing the amount of data processed, which speeds up query performance. To take full advantage of this feature, sort the data by columns frequently used for filtering before saving. This strategy minimizes the chunk range and the amount of data that needs to be parsed during reads.
-
Effective Partitioning: Partitioning large datasets helps manage and query data more efficiently. Use
partition_cols
to split the dataset by specific columns, such as date, creating a directory structure like/data/xxx/date=YYYY-mm-dd
. This structure, known as a Parquet dataset, allows Parquet to automatically filter files during reads, reducing data parsing and scanning. However, avoid creating too many small files, as the overhead of managing numerous small files can degrade performance. Generally, aim for individual file sizes between tens of megabytes and 1 GB. When reading partitioned datasets, the partition columns are converted to Arrow dictionary types (pandas categorical), which may disrupt column order. To maintain consistent row order during read and write, set the DataFrame's index to a sequential index before saving and useindex=True
into_parquet
. After reading, sort the DataFrame by index withsort_index
. -
Consistent Data Handling: Ensuring consistent row order and handling NaN values appropriately maintains data integrity and compatibility across different processing steps. For DataFrame columns of type
object
containingnp.nan
values, Parquet writes and readsNaN
values asNone
. This change fromNaN
toNone
can affect compatibility in some scenarios, asNone
is treated as a different type. To maintainNaN
values, set theuse_nullable_dtypes
parameter toTrue
inread_parquet
, though this is an experimental feature in pandas and may change in future versions. Alternatively, convertNone
back toNaN
after reading.
Why Parquet is Essential for Modern Data Lakehouses
The data lakehouse architecture is gaining traction for its ability to combine the scalability and flexibility of data lakes with the performance and reliability of data warehouses. At the heart of many successful data lakehouse implementations lies Parquet, a columnar storage format that enhances the efficiency and effectiveness of data storage and processing. Here’s why Parquet is both popular and important in the context of data lakehouses.
Efficient Storage and Compression
Parquet's columnar storage format allows for highly efficient data compression, which is crucial in the data lakehouse environment where vast amounts of data are stored. By storing data column-wise, Parquet can apply compression algorithms that significantly reduce the storage footprint. This is particularly beneficial for repetitive data, such as log files or transactional data, where columns often contain many repeated values. The reduced storage requirements lower costs and improve data retrieval speeds.
Example: In a retail data lakehouse, sales data with millions of transactions can be stored in Parquet format. The repeated product IDs, customer IDs, and timestamps can be compressed effectively, saving storage space and speeding up analytics tasks.
Optimized Query Performance
Parquet excels in read-heavy environments due to its columnar format, which allows for selective data access. When querying large datasets, only the relevant columns are read, minimizing I/O operations and significantly speeding up query execution. This is particularly important in data lakehouses, where real-time analytics and complex queries are common.
Example: A data lakehouse storing IoT sensor data can use Parquet to quickly retrieve temperature readings for anomaly detection. By reading only the temperature column, the system avoids unnecessary I/O operations, resulting in faster query responses.
Support for Complex Data Types
Parquet's ability to handle complex nested data structures makes it ideal for the diverse data stored in a data lakehouse. Whether dealing with JSON logs, hierarchical data, or arrays, Parquet can efficiently store and query these formats, preserving their structure and relationships.
Example: In a healthcare data lakehouse, patient records often include nested data such as medical history, lab results, and treatment plans. Parquet can store this data in its nested format, allowing for efficient querying and analysis without the need to flatten the data.
Interoperability and Ecosystem Support
Parquet's wide support across various big data processing frameworks and tools makes it a versatile choice for data lakehouses. It is natively supported by Apache Spark, Hive, Impala, and many other tools, enabling seamless integration and processing. This interoperability ensures that data stored in Parquet format can be easily accessed and processed by different systems and applications.
Example: A data lakehouse used by a financial institution can leverage Spark for data processing, Hive for SQL-based querying, and Impala for fast analytics, all while using Parquet as the underlying storage format. This ensures consistent and efficient data processing across different tools.
Scalability and Flexibility
Data lakehouses need to scale to handle growing data volumes and evolving data types. Parquet’s design allows it to scale efficiently, handling large datasets with ease. Its support for schema evolution also provides the flexibility to adapt to changes in data structure over time without requiring a complete rewrite of existing data.
Example: An e-commerce data lakehouse can store evolving customer behavior data in Parquet format. As new data points (e.g., new product categories or marketing channels) are added, Parquet’s schema evolution capabilities ensure that the data structure can be updated without impacting existing analytics workflows.
Governance and Metadata Management
Data governance is critical in data lakehouses, and Parquet supports rich metadata, which is essential for tracking data lineage, managing schema changes, and enforcing data governance policies. This metadata helps maintain data quality and compliance with regulatory requirements.
Example: In a data lakehouse used by a government agency, Parquet’s metadata capabilities can be utilized to track data sources, transformations, and usage, ensuring transparency and compliance with data governance standards.
Parquet's popularity and importance in data lakehouses stem from its ability to provide efficient storage, optimized query performance, support for complex data types, interoperability, scalability, and robust data governance. These features make Parquet an ideal choice for managing the diverse and growing data needs of modern data lakehouses. By leveraging Parquet, organizations can achieve significant performance improvements, cost savings, and flexibility in their data processing and analytics workflows, ensuring they can handle the challenges of big data effectively.