A Deep Dive into Massively Parallel Processing (MPP) Architecture

Join StarRocks Community on Slack

Connect on Slack

TABLE OF CONTENTS

See All Glossary Items

Greenplum

Vertica Explained: Understanding Its Core Features

Exasol

What Really Happens When You Run a SQL Query? A Look into Query Execution

From Theory to Best Practices: Single Instruction, Multiple Data (SIMD)

Publish date: Sep 19, 2023 10:38:30 PM

What is Massively Parallel Processing (MPP)

Massively Parallel Processing, or MPP, is a distributed computing architecture designed to execute large-scale analytical workloads by dividing them into smaller tasks and running them simultaneously across multiple independent nodes. Each node typically has its own memory, CPU, and sometimes storage, enabling efficient parallel execution without centralized bottlenecks.

This design is particularly well-suited for Online Analytical Processing (OLAP) scenarios—such as complex SQL queries, large joins, or aggregations over billions of records—where performance and scalability are essential.

MPP vs Other Compute Architectures

Modern distributed query engines typically adopt one of three architectural paradigms to execute analytical workloads: scatter-gather, stage-based (MapReduce-style), and in-memory MPP. Each has distinct trade-offs in terms of scalability, latency, and workload suitability.

1. Scatter-Gather Architecture

In scatter-gather systems, a central coordinator distributes query sub-tasks to worker nodes. Each worker processes a partition of data and sends partial results back to the coordinator, which performs the final merge or aggregation.

Used by: ClickHouse and similar systems.

Strengths:

Simple architecture.
Efficient for narrow queries on partitioned or pre-aggregated data.

Limitations:

Final aggregation bottlenecks on the coordinator.
Limited join performance; no inter-node shuffle.
Concurrency and complexity reduce throughput.

Best for:

Operational analytics on flat tables.
Single-table access with simple aggregations.

2. Stage-Based Execution (MapReduce, DAG)

These systems decompose queries into multiple execution stages, often persisted to disk between phases. Intermediate data is exchanged via distributed storage (e.g., HDFS or S3).

Used by: Apache Spark, Databricks Photon, Flink (in batch mode), and similar DAG-based systems.

Strengths:

High fault tolerance.
Extremely flexible execution model.
Elastic scaling in cloud-native environments.

Limitations:

High latency from disk I/O and serialized stage transitions.
Fragmented execution pipelines.
Complex resource tuning and execution management.

Best for:

Large-scale ETL and batch jobs.
Offline transformation of structured or semi-structured data.

3. In-Memory MPP Execution

In-memory MPP systems use a shared-nothing model. Each node executes part of the query plan concurrently and exchanges intermediate results via in-memory data shuffles, not disk. Operators run in pipelined, vectorized fashion for optimal CPU utilization.

Used by: StarRocks, Impala, Greenplum, and others.

Strengths:

Low-latency execution for complex, multi-stage queries.
Fully parallel JOINs, GROUP BYs, and filters.
Real-time ingestion and up-to-date analytical results.

Limitations:

Requires sufficient memory and high-throughput networking.
Resource planning and execution tuning can be non-trivial.

Best for:

Interactive dashboards.
Customer-facing analytics.
High-cardinality aggregation and JOIN-heavy workloads.

Summary Comparison Table

Feature	Scatter-Gather	Stage-Based (MapReduce/DAG)	In-Memory MPP
Data Exchange	Workers → Leader	Disk-based between stages	Memory-to-memory
Join Capability	Limited (centralized)	Flexible but slow	Distributed, scalable
Fault Tolerance	Minimal	High	Moderate (varies by impl)
Latency	Moderate	High	Low (sub-second capable)
Suitability for Real-Time Use	Limited	Poor	Excellent
Resource Elasticity	Limited	High	Moderate to High
Operational Complexity	Low	High	Medium

Core Features of an MPP Database

To understand how MPP engines achieve high performance and scalability, it's helpful to unpack the key design elements they rely on:

1. Shared-Nothing Architecture

Each node runs independently with its own CPU and memory, minimizing contention and maximizing parallelism.

2. Data Partitioning

Tables are horizontally partitioned across nodes (via hash, range, or buckets), so each node handles only a portion of the data.

3. Query Fragmentation and Scheduling

Queries are decomposed into fragments—logical units like scans, joins, and aggregations—which are then scheduled as physical execution units on different nodes.

4. Shuffling and Join Strategies

Data is shuffled between nodes for distributed joins and group-bys. Systems often support strategies like hash shuffle, broadcast join, and colocate join.

Example: StarRocks performs in-memory shuffles during distributed joins and executes operations using SIMD-vectorized pipelines.

5. Coordinator Role

Most MPP engines include a lightweight coordinator to parse queries, generate plans, and schedule fragment instances. Unlike scatter-gather systems, final aggregation is often distributed among worker nodes.

A Closer Look: MPP in Practice with StarRocks

StarRocks is an open-source distributed SQL engine that implements a classic MPP execution framework. Each query is broken into independent tasks and executed across compute nodes with local CPU and memory.

starrocks mpp

Fragmentation and Parallel Execution

When a query is submitted, StarRocks decomposes it into logical fragments. Each fragment is responsible for part of the plan—such as scanning, projecting, joining, or aggregating—and is instantiated into multiple fragment instances. These are the smallest units of scheduling and execution.

Each instance runs in parallel on a backend (BE) node. Operators are pipelined—e.g., Scan → Project → Aggregate—and vectorized for CPU efficiency. The number of fragment instances is determined by cluster resources and data volume.

starrocks mpp

Figure: A query like SELECT COUNT(*) FROM table GROUP BY id is broken into fragments: scanning and projection (Fragment 2), local aggregation (Fragment 1), and final aggregation (Fragment 0). Each fragment is instantiated multiple times and distributed across backend nodes.

In-Memory Data Shuffling and Merge Operations

During operations like distributed JOIN or GROUP BY, StarRocks shuffles data across nodes based on partition keys. This redistribution happens in memory, using hash-based routing (e.g., HASH(id) % N) to ensure that records with the same key land on the same node for aggregation.

starrocks mpp2

Figure: In a GROUP BY query, compute nodes perform local aggregation, then redistribute intermediate results based on the group key. Final aggregation is distributed across multiple nodes.

This model avoids bottlenecks typical of scatter-gather systems and maintains low latency even on high-cardinality operations.

Scaling Characteristics

StarRocks follows a shared-nothing architecture. Compute nodes are stateless and can be scaled horizontally. Adding nodes increases parallelism without central bottlenecks—as long as data is partitioned evenly and network bandwidth is adequate.

The architecture supports high-concurrency scenarios such as:

Real-time dashboards
External analytics applications
Lakehouse workloads

Why MPP Matters

Traditional SMP (Symmetric Multiprocessing) systems—where multiple CPUs share a single memory and disk bus—suffice for OLTP workloads, but break down under OLAP demands:

Shared resource contention increases latency.
Vertical scaling is expensive and limited.
A single long-running query can degrade performance for all users.

MPP Solves These Problems by Distributing

Workload Parallelism

In MPP systems, queries are broken down into smaller fragments that can be executed in parallel across the cluster. Each node processes a partition of the data—scanning, joining, and aggregating without having to coordinate every operation with a central coordinator.

Result: Better CPU and memory utilization, reduced query latency, and improved concurrency under load.
Example: A GROUP BY aggregation is processed locally on each node, followed by a distributed merge step—this avoids funneling all data through a central coordinator.

Horizontal Scalability
MPP scales out, not up. As data volume or query load grows, more compute nodes can be added to the cluster. Each new node contributes additional CPU, memory, and bandwidth.

Example: At TRM Labs, their largest customer-facing workload spans 115 TB+ and grows by 2–3% monthly. Migrating to a horizontally scalable MPP engine helped maintain sub-3s latency with high concurrency—something BigQuery struggled to deliver cost-effectively, especially as dataset size and dashboard complexity increased.

Isolation of Resources
Each MPP node is independent—there’s no shared memory or CPU. This minimizes contention between workloads, even in multi-tenant or mixed query environments.

Contrast: In SMP systems, a large aggregation query can monopolize memory bandwidth, blocking smaller lookups.
Case Study Insight: At Eightfold.ai, the team found that Redshift’s leader node became a bottleneck under concurrent workloads. StarRocks’ distributed architecture—featuring separate frontends for planning and backends for execution—allowed them to avoid single-node saturation, unlocking high concurrency without degrading latency across their customer-facing analytics platform

Fault Tolerance

Because tasks are distributed, MPP systems can survive node failures without crashing the entire query. Tasks can be retried or reassigned to healthy nodes.

Example: In Kubernetes-native deployments (e.g., at NAVER), MPP engines like StarRocks automatically reassign tasks if a pod dies—keeping the system resilient without manual intervention.

Elasticity for the Cloud

Modern MPP engines decouple compute from storage, allowing compute nodes to scale elastically based on workload. Stateless nodes can spin up or down as needed, reducing idle cost.

Implementation Tip: MPP systems with support for object storage (like S3 or HDFS) can cache hot data on local disks (e.g., EBS) to reduce I/O latency without sacrificing durability.
Eightfold Insight: Their team found serverless OLAP offerings lacked sufficient EBS caching to maintain high concurrency. They opted for dedicated compute nodes to ensure local data residency and predictable performance— especially critical for supporting low-latency, multi-tenant analytics experiences at scale.

In summary, MPP fundamentally changes the game for analytics systems by distributing both computation and data. This architecture supports real-time responsiveness, large-scale aggregation, and concurrency at levels that centralized systems cannot sustain.

Conclusion

Massively Parallel Processing (MPP) represents a foundational shift in how analytical systems handle scale, complexity, and performance. Instead of relying on vertically scaled, resource-shared monoliths, MPP distributes both computation and data across a cluster of independent nodes—each responsible for a fragment of the work.

This architectural model unlocks three key capabilities:

Scalability: Horizontal growth without coordination bottlenecks.
Performance: Low-latency execution even for complex multi-stage queries.
Resilience: Fault isolation, workload separation, and elasticity under pressure.

Whether you're building high-concurrency dashboards, lakehouse analytics platforms, or multi-tenant external analytics services, MPP is the architectural backbone that enables real-time interactivity at scale. And with the rise of cloud-native engines that support vectorized execution, elastic compute, and decoupled storage—like StarRocks—MPP is no longer reserved for niche use cases. It's becoming the default architecture for modern analytical systems.

Frequently Asked Questions (FAQ)

Q1. Is MPP the same as distributed computing?

Not exactly. MPP is a subset of distributed computing, specifically optimized for parallel SQL query execution. It emphasizes:

Shared-nothing compute nodes
Horizontal data partitioning
In-memory data shuffling
Other distributed systems (e.g., Hadoop MapReduce) may distribute work but rely heavily on disk and are not optimized for low-latency OLAP.

Q2. When should I use MPP over traditional SMP databases?

Use MPP when:

Your dataset is large (hundreds of GBs to PB-scale).
Your queries involve complex joins, aggregations, or window functions.
You need to support many concurrent users or sub-second dashboard performance.
Your workloads demand horizontal scale (e.g., multi-tenant analytics).

Q3. What are the trade-offs of using MPP systems?

Higher complexity: Query planning, data partitioning, and resource tuning are more involved than in single-node systems.
Operational sensitivity: Poor data distribution or improper join strategy (e.g., large broadcast joins) can degrade performance.
Memory pressure: In-memory shuffles require sufficient RAM and fast networks to avoid stalls.

Q4. How does MPP handle JOIN operations efficiently?

MPP engines use distributed join strategies:

Hash Join: Partitions data across nodes based on join keys.
Broadcast Join: Replicates small tables to all nodes.
Colocate Join: Uses pre-partitioned data to avoid shuffling.
In vectorized engines like StarRocks, these joins are executed in parallel using SIMD and pipelining, minimizing CPU cost per row.

Q5. How is fault tolerance achieved in MPP systems?

Node-level isolation: Failures are contained and retries can be rescheduled.
Kubernetes-native MPP (e.g., NAVER’s deployment): Failed pods are replaced automatically.
Stateless compute: Many modern MPP engines can reassign fragments without persisting intermediate state to disk.

Q6. What’s the role of a coordinator node in MPP?

The coordinator (or frontend) typically:

Parses SQL
Optimizes and fragments query plans
Schedules fragments across compute nodes

In contrast to scatter-gather systems, most MPP engines don’t rely on the coordinator for final aggregation—reducing bottlenecks and improving scalability.

Q7. Does MPP require decoupled storage?

No, but decoupled storage enhances MPP elasticity. Engines that separate compute from storage (e.g., via Apache Iceberg or object stores) can:

Scale compute elastically (spin up/down nodes)
Cache hot data on local SSDs or EBS
Support cost-efficient query execution over archival data

Q8. How does MPP support real-time data?

Modern MPP engines ingest data through:

Streaming (e.g., Kafka) or micro-batch pipelines
Primary key models that support upserts
Materialized views for pre-aggregated fast access

These capabilities make MPP suitable not just for batch OLAP but also real-time use cases like fraud detection or campaign monitoring.

Q9. Is MPP suitable for customer-facing analytics (CFA)?

Yes. MPP systems with low-latency joins and support for star schema queries are ideal for CFA:

They isolate tenants
Serve interactive dashboards
Handle unpredictable access patterns

Eightfold.ai, for instance, migrated from Redshift to an MPP engine to eliminate concurrency issues tied to Redshift’s centralized leader node bottleneck.

Q10. Do I always need an MPP engine?

No. For small datasets (<100GB), low concurrency, and simple OLAP workloads, a well-tuned SMP system or even a cloud warehouse tier may suffice. But as data volume, schema complexity, or query concurrency grows, MPP becomes necessary to avoid vertical scaling limits.

Recommended Resources

StarRocks: A Game-Changer in Real-Time Analytics

Understand the challenges you can expect from real-time analytics and learn how you can easily avoid these obstacles with the help of StarRocks.

Airbnb Builds a New Generation of Fast Analytics Experience with StarRocks

Learn from Airbnb's journey. Get a deep dive into how Airbnb developed their real-time data analytics infrastructure with StarRocks.

The 5 Most Important Criteria for Choosing an Analytics Platform

Discover the top five criteria to consider when selecting an analytics platform, and the essential questions you should ask the vendors while seeking the ideal solution in this blog.

A Deep Dive into Massively Parallel Processing (MPP) Architecture

What is Massively Parallel Processing (MPP)

MPP vs Other Compute Architectures

1. Scatter-Gather Architecture

2. Stage-Based Execution (MapReduce, DAG)

3. In-Memory MPP Execution

Summary Comparison Table

1. Shared-Nothing Architecture

2. Data Partitioning

3. Query Fragmentation and Scheduling

4. Shuffling and Join Strategies

5. Coordinator Role

A Closer Look: MPP in Practice with StarRocks

Fragmentation and Parallel Execution

In-Memory Data Shuffling and Merge Operations

Scaling Characteristics

MPP Solves These Problems by Distributing

Workload Parallelism

In MPP systems, queries are broken down into smaller fragments that can be executed in parallel across the cluster. Each node processes a partition of the data—scanning, joining, and aggregating without having to coordinate every operation with a central coordinator.

Horizontal ScalabilityMPP scales out, not up. As data volume or query load grows, more compute nodes can be added to the cluster. Each new node contributes additional CPU, memory, and bandwidth.

Isolation of ResourcesEach MPP node is independent—there’s no shared memory or CPU. This minimizes contention between workloads, even in multi-tenant or mixed query environments.

Fault Tolerance

Elasticity for the Cloud

Conclusion

Frequently Asked Questions (FAQ)

Q1. Is MPP the same as distributed computing?

Q2. When should I use MPP over traditional SMP databases?

Q3. What are the trade-offs of using MPP systems?

Q4. How does MPP handle JOIN operations efficiently?

Q5. How is fault tolerance achieved in MPP systems?

Q6. What’s the role of a coordinator node in MPP?

Q7. Does MPP require decoupled storage?

Q8. How does MPP support real-time data?

Q9. Is MPP suitable for customer-facing analytics (CFA)?

Q10. Do I always need an MPP engine?

Recommended Resources

Have questions? Talk to a CelerData expert.

Horizontal Scalability
MPP scales out, not up. As data volume or query load grows, more compute nodes can be added to the cluster. Each new node contributes additional CPU, memory, and bandwidth.

Isolation of Resources
Each MPP node is independent—there’s no shared memory or CPU. This minimizes contention between workloads, even in multi-tenant or mixed query environments.