Apache Druid

What is Apache Druid?

Apache Druid is a distributed, column-oriented data processing system designed to support real-time OLAP (Online Analytical Processing) analysis with high-speed data ingestion and flexible, real-time multidimensional queries. It is often used as a database to support ingestion, high-performance queries, aggregations, and high-concurrency API calls, providing reliable performance for demanding applications. Druid also excels at pre-aggregating and analyzing data based on timestamps, making it suitable for time-series data analysis.

Druid features a graphical user interface (GUI) and is ideal for scenarios involving event-driven data, real-time data extraction, and high-performance queries. Unlike systems like Apache Kylin, which require explicit pre-calculation (MOLAP approach), Druid follows the ROLAP approach, performing real-time data ingestion and providing immediate query results without a predefined cube-building process.

A closer look at MPP and Lambda

  • MPP (Massively Parallel Processing): A database architecture where each node has its own memory and disk system, enabling tasks to be distributed and processed in parallel across multiple nodes.
  • Lambda Architecture: Designed to balance the benefits of stream processing and batch processing, combining low-latency data with more accurate, processed data for real-time analytics. It consists of three layers: Batch Layer (for pre-processing historical data), Speed Layer (for real-time incremental data processing), and Serving Layer (for merging both datasets for queries).

History and Development

 

Origin of Apache Druid

The creators of Apache Druid include Eric Tschetter, Fangjin Yang, Gian Merlino, and Vadim Ogievetsky. They initiated the project in 2011 to power the analytics product of Metamarkets. The goal was to develop a system that could ingest large quantities of event data and provide low-latency queries.

Evolution and Milestones

Apache Druid has evolved significantly since its inception. The project has seen numerous milestones:
  • 2012: Open-source release of Apache Druid.
  • 2015: Integration with Apache Kafka for real-time data ingestion.
  • 2018: Graduation to a top-level project within the Apache Software Foundation.
  • 2020: Introduction of Imply Polaris, a cloud database tailored for Apache Druid.
These milestones highlight the continuous improvements and growing adoption of Apache Druid in the industry.

Core Features of Apache Druid

  • Low-latency, interactive queries: Druid offers sub-second query responses for real-time data ingestion, thanks to pre-aggregation, columnar storage, and bitmap indexing.
  • High availability: By storing segments in deep storage (like HDFS or S3) and supporting multi-replica ingestion, Druid ensures data availability and fault tolerance.
  • Horizontal scalability: Druid can scale to clusters with hundreds of servers, handling millions of writes per second while delivering sub-second query responses for massive datasets.
  • Advanced querying capabilities: Druid supports time series queries, TopN, GroupBy, and both API and SQL-based querying, making it flexible for a wide range of analytical needs.
  • Columnar storage: Druid's columnar architecture optimizes query speeds for OLAP workloads by enabling efficient scanning and aggregation.
  • Real-time and batch ingestion: Druid supports both real-time data ingestion for instant queries and batch data loading for historical analysis.

 

What Are the Benefits of Using Apache Druid?

Apache Druid is purpose-built for fast data ingestion and real-time querying, designed to support workflows that demand rapid access to data. Druid features a powerful user interface (UI) that allows for operational queries and high concurrency, making it an excellent choice for scenarios that require interactive and consistent user experiences.

Seamless Integration with Existing Data Pipelines

Druid integrates easily with data streams from message queues like Kafka and Amazon Kinesis, and can batch-load files from data lakes such as HDFS and Amazon S3. It supports a wide range of popular structured and semi-structured data formats, providing flexibility for various data ingestion needs.

Fast, Consistent Queries Under High Concurrency

Druid significantly outperforms traditional solutions in terms of data ingestion and query performance, thanks to its innovative storage and indexing structure, along with its ability to handle both exact and approximate queries. This allows it to achieve sub-second query times, even under heavy load.

Wide Applicability Across Use Cases

Druid is designed for rapid ad hoc querying of both real-time and historical data. Its flexibility unlocks new workflows across a variety of industries, including clickstream analysis, application performance monitoring (APM), supply chain management, network telemetry, digital marketing, fraud detection, and more.

Key Use Cases:

  • User Activity and Behavior: Druid is frequently used to analyze clickstream, visit stream, and activity stream data. It can measure user engagement, track A/B test results, and analyze user behavior. Druid excels at calculating user metrics, such as daily active users (DAU), with both exact and approximate methods (with an average accuracy of 98%). It's also great for funnel analysis, helping product teams track user actions like registration steps.

  • Network Traffic: Druid is a go-to solution for collecting and analyzing network traffic data. It can quickly query and aggregate large volumes of network records across multiple attributes, such as IP addresses, ports, geolocation, and device details, making network analysis fast and efficient.

  • Digital Marketing: In the online advertising space, Druid is used to store and query ad performance data. It helps advertisers assess campaign effectiveness, click-through rates, and conversion metrics. Originally built to analyze ad data, Druid has scaled to support petabyte-level datasets across thousands of servers globally.

  • Application Performance Management (APM): Druid can track operational data generated by applications, providing insights into how users interact with the app or reporting internal application metrics. Druid allows for deep dives into performance bottlenecks and faster analysis of thousands of app event attributes. For example, it can measure API latency or evaluate resource utilization based on various criteria such as geography or user profile.

  • IoT and Device Metrics: Druid can serve as a time-series database for storing and analyzing real-time server and device metrics. Unlike traditional time-series databases, Druid is an analytics engine that combines time-based partitioning, columnar storage, and search indexing. It allows for rapid time-based querying and aggregation, supporting millions of unique dimension values for flexible grouping and filtering.

  • OLAP and Business Intelligence: Druid is often used in business intelligence applications to accelerate queries and enhance interactive dashboards. Compared to Hadoop-based SQL engines like Presto or Hive, Druid is optimized for high concurrency and sub-second query performance, making it a better fit for real-time, interactive visual data analysis.

Flexible Deployment

Druid can be deployed on any commercial hardware running a *NIX environment, whether in the cloud or on-premises. Its scalability is seamless, allowing for easy horizontal scaling by adding or removing Druid services based on the workload.

In summary, Apache Druid is an excellent choice for organizations looking to build real-time analytics databases that handle high-frequency data ingestion and support interactive, low-latency querying across a wide range of use cases. Its ability to seamlessly integrate with modern data pipelines, handle high concurrency, and deliver fast, consistent results makes it a powerful tool for analytics-driven applications.


Challenges with Apache Druid

  • Performance Lag: Once a leader, Apache Druid now trails newer OLAP databases, such as StarRocks, which outperforms Druid by 8.9x in benchmark tests. One reason for this performance lag is Druid's Java-based architecture, which doesn’t efficiently use SIMD (Single Instruction, Multiple Data) optimizations. SIMD is essential for maximizing performance in OLAP systems by enabling them to take full advantage of modern CPU architectures, particularly for parallel processing.

    Apache Druid Benchmarks (1)

Figure 3: Apache Druid Benchmark Performance

  • No Efficient JOIN Support: Druid struggles with JOIN operations at scale, requiring costly denormalization for multi-table queries, especially in real-time use cases.


    Discover how Airbnb switched from StarRocks to Druid to enable on-the-fly joins and aggregation.
  • Limited Cloud Scalability: Druid must preload data from cloud storage (e.g., AWS S3) to local storage before querying, limiting dynamic scaling and increasing operational costs.

  • No Real-Time Data Updates: Druid lacks native support for real-time mutable data, forcing complex workarounds like Kafka preprocessing or large table appends, adding complexity and errors to the pipeline.

  • Complex Architecture: Deploying Druid requires six core services and external dependencies, leading to high operational overhead and increased maintenance complexity.

    Druid Architecutre (1)

    Figure 4: A Complex Apache Druid-Based Architecture

    The intricate nature of this architecture can significantly increase the operational burden, introducing greater risk for system failures and making maintenance more challenging. As the system scales, keeping all components running smoothly becomes increasingly difficult, further raising the complexity and cost of operating Apache Druid.

Top Alternatives to Druid: Why StarRocks Stands Out

StarRocks, a project supported by the Linux Foundation, is a real-time MPP OLAP database known for delivering exceptional performance on a broad array of SQL queries. Designed for modern real-time analytics, it provides a robust solution for businesses that prioritize speed and scalability.

Advantages of StarRocks:

  • Exceptional query performance with JOINs
    StarRocks' MPP architecture, enhanced by in-memory data shuffling, a cost-based optimizer, and SIMD-optimized vector execution, offers top-tier performance for large-scale multi-table OLAP queries.
  • Superior benchmark results
    In SSB benchmark comparisons, StarRocks outpaced Apache Druid, with its multi-table queries outperforming even Druid’s single-table queries. This eliminates the need for complex denormalization while delivering more than double the performance boost. Read more about StarRocks Vs Druid.
  • Streamlined storage-compute-separated architecture
    StarRocks employs a storage-compute separation model, making it lightweight and independent of external systems. It dynamically scales to accommodate changing workloads and facilitates direct queries on distributed storage, such as AWS S3.
  • Real-time updates with primary key indexing
    StarRocks supports real-time updates and deletions through its primary key table, without sacrificing query speed. The specialized primary key index ensures efficient handling of upserts and deletes, keeping only the latest version of data available for queries.

StarRocks is an excellent alternative for those seeking a high-performance OLAP database with real-time analytics capabilities, scalability, and cost-efficient operations.