Join StarRocks Community on Slack

Connect on Slack

TABLE OF CONTENTS

See All Glossary Items

Apache Paimon Explained: An In-Depth Guide

DBeaver

Convolutional Neural Network (CNN)

Amazon Kinesis

Getting Started with SQL: A Practical Overview

Publish date: Aug 1, 2024 4:37:10 PM

What is Apache Kafka?

Let’s start with a familiar challenge.

Imagine you’re building a system that needs to handle tens of thousands of events per second—clicks, purchases, logins, sensor updates, fraud alerts—and make sense of them as they happen. You need something fast, fault-tolerant, and scalable. Something that won’t fall over when you double your traffic. And you want the data to be accessible not just once, but for days, weeks, maybe forever.

This is the kind of problem Apache Kafka was built to solve.

Kafka is an open-source, distributed event streaming platform. It was originally built at LinkedIn to track user activity at scale, but today it's a critical backbone for systems at companies like Netflix, Goldman Sachs, Uber, and thousands more. At its core, Kafka acts like a high-performance, distributed log—but one that’s incredibly good at scaling out, retaining data, and letting you process it in real time.

Why Think of Kafka as a Log?

This part is important. A lot of people come to Kafka thinking it’s just another message queue—like RabbitMQ, maybe with more horsepower. But Kafka’s core design is very different.

Kafka is fundamentally a distributed append-only log. That means every event is written to disk in the order it arrives, and it stays there—typically for hours, days, or even forever. Unlike traditional queues that delete messages after they're read, Kafka retains data, and consumers control their own progress through it. If something fails downstream? Just rewind and try again.

This model unlocks a lot of powerful capabilities:

Replayability: Re-process old data by rewinding the log.
Durability: Messages don’t vanish after consumption—they’re persisted and replicated.
Separation of concerns: Producers and consumers don’t need to know about each other’s timing or failures.

Core Concepts, Explained Like a System Designer

Let’s break down Kafka’s key components, with practical context.

Topics and Partitions: Your Streams, Your Scale

A topic is a named stream of records. You can think of it like a pipe: your producer puts events in one end, and your consumers pull them out the other.

But to scale, Kafka doesn’t just use a single pipe—it splits each topic into partitions. Each partition is an independent log, and Kafka spreads these across multiple brokers (servers).

This is where Kafka gets its performance. Each partition can be read and written independently and in parallel.

Example:
A topic called orders might have 12 partitions. Each new order is assigned to a partition using the user_id as the key, ensuring all actions for a single user stay ordered, but still allowing massive parallelism.

Producers: Writing to Kafka, Fast and Flexible

A producer is your data source. It could be your web server, your backend service, a mobile API gateway—anything that emits events.

Kafka producers support:

Key-based partitioning for consistent routing.
Asynchronous writes so you don’t block on every send.
Configurable durability, depending on how critical the data is.

You choose the trade-offs:

acks=0 if you don’t care about delivery.
acks=1 for low-latency.
acks=all if durability is paramount.

Real-world analogy:
Think of a retail POS system publishing every sale into Kafka in real time, keyed by product ID. Inventory systems, analytics tools, and dashboards can all read from this single source of truth.

Consumers: Reading at Your Own Pace

Consumers are the clients that process the data. They could be real-time fraud detection models, batch loaders, or alerting systems.

Kafka’s consumer model is incredibly flexible:

Consumer groups allow load-balanced, parallel consumption.
Each group gets its own cursor, independent of others.
You can rewind, re-read, and pause as needed.

Example:
A fraud detection service with five replicas forms a consumer group. Kafka evenly divides topic partitions between them, giving each replica a subset of data to process.

Brokers and Clusters: Kafka’s Distributed Backbone

A Kafka broker is a server that stores data and responds to client requests. A Kafka cluster is made up of multiple brokers. Partitions are spread across these brokers for load balancing.

For each partition:

One broker is the leader (handles reads and writes).
Others are followers (replicate the data).

Kafka uses leader election to ensure availability. If a leader broker crashes, a follower takes over. This is why Kafka stays available even when hardware fails.

In production, you’ll usually use a replication factor of 3. That means each partition exists on 3 brokers, and Kafka can tolerate multiple failures without losing data.

ZooKeeper vs. KRaft: Managing Kafka’s Metadata

Until recently, Kafka used ZooKeeper to manage metadata (e.g., topic configs, broker health, leader election). But ZooKeeper added complexity.

Starting with Kafka 2.8, a new mode was introduced: KRaft (Kafka Raft Metadata mode). It removes ZooKeeper entirely and embeds metadata management into Kafka itself.

As of Kafka 3.5, KRaft is the default for new clusters.

ZooKeeper mode: Still widely used in older clusters.
KRaft mode: Simpler deployments, easier scaling, no external coordination service.

How Kafka Moves Data: A Step-by-Step Example

Let’s walk through what happens when you send a message.

Your producer sends a record to a topic called checkout-events.
Kafka uses a key to decide which partition to use (say, partition 3).
The broker appends the record to that partition’s log.
Followers replicate the data.
One or more consumers read from partition 3, starting at their last offset.
The offset is committed (automatically or manually) depending on your settings.

It’s simple on the surface—but powerful under the hood. And it’s incredibly fast. Kafka can handle millions of messages per second, with latencies measured in milliseconds.

Kafka’s Storage Model: Durability with Simplicity

Kafka stores data on disk in log segments, each about 1 GB by default. These are memory-mapped for efficient access.

You control how long Kafka keeps data through:

Time-based retention (e.g., keep messages for 7 days).
Size-based retention (e.g., keep up to 500 GB).
Log compaction (retain only the latest record for each key).

Compacted topics are great for things like metadata tables, where you only care about the most recent value per ID.

What Makes Kafka Special?

High Throughput, Low Latency

Kafka is optimized for performance at scale:

Sequential writes to disk.
Batched I/O.
Compression (Snappy, LZ4, Zstd).
Zero-copy network transfer.

Durability and Fault Tolerance

Data is persisted to disk before acknowledgment.
Replication across brokers provides redundancy.
Failures are expected—and recoverable.

Scalability by Design

Kafka scales horizontally via partitioning.
Producers, consumers, and brokers can all scale independently.
Clusters can span hundreds of brokers.

Data Replay and Retention

Kafka isn’t a queue—it’s a log.
Consumers can replay old messages anytime.
Useful for reprocessing, backfills, audits, and disaster recovery.

Kafka vs. Other Event Streaming and Messaging Systems

Making the Right Architectural Choice

Apache Kafka is often the first platform considered when building event-driven systems—but it's not the only option. In practice, many teams evaluate Kafka alongside other technologies like RabbitMQ, Apache Pulsar, Amazon Kinesis, and Google Pub/Sub.

Each system has strengths and trade-offs. The best choice depends on what you're optimizing for: throughput, latency, ease of use, delivery guarantees, multi-tenancy, cloud-native integration, and more.

Let’s walk through a side-by-side comparison, starting with core capabilities.

High-Level Comparison Table

Feature	Kafka	RabbitMQ	Apache Pulsar	Amazon Kinesis	Google Pub/Sub
Core Model	Distributed log	Traditional message queue	Distributed log + message queue hybrid	Managed sharded log	Managed pub/sub system
Replayability	Yes (offset-based)	No (unless persisted separately)	Yes (cursor-based)	Yes (for 7 days)	Yes (7 days retention default)
Message Ordering	Per partition	Per queue	Per partition (with key)	Per shard	Ordering with keys (manual config)
Delivery Guarantees	At-least-once / exactly-once	At-most-once / at-least-once	At-least-once / exactly-once	At-least-once	At-least-once
Latency (Typical)	Low (sub-10ms)	Very low (sub-ms to few ms)	Low (similar to Kafka)	Higher (~70–100ms typical)	Higher (~100ms typical)
Throughput (per topic)	Very high (>1M msgs/sec)	Moderate (~tens of K msgs/sec)	Comparable to Kafka	Moderate (~100K msgs/sec per shard)	Moderate to high (depends on quotas)
Horizontal Scaling	Partitions across brokers	Manual queue sharding	Segmented architecture	Automatic (via shards)	Fully managed scaling
Multi-Tenancy	Limited (can be done manually)	Poor (not designed for this)	Native support with namespaces	No real isolation	Basic project-level isolation
Persistence Model	Disk-backed logs, local storage	In-memory/disk queue	Tiered (hot + cold storage)	S3-like backend	Managed durable storage
Deployment Flexibility	Self-hosted, Kubernetes, cloud-native	Lightweight, easy to deploy	Self-hosted, Kubernetes-native	Fully managed (AWS only)	Fully managed (GCP only)
Ecosystem Tools	Kafka Streams, Connect, Schema Registry	Plug-ins, Shovel, Federation	Pulsar Functions, IO Connectors	AWS Firehose, Lambda, Glue	GCP Dataflow, Cloud Functions

Kafka vs. RabbitMQ

When to Use Kafka vs. When to Use RabbitMQ

Kafka is better when you need:
- High-throughput streaming pipelines.
- Event sourcing or reprocessing capabilities.
- Long-term event retention.
- Real-time analytics or streaming joins.
RabbitMQ is better when you need:
- Simple, low-latency task queues or RPC (Remote Procedure Call).
- Complex routing logic (fanout, topic exchanges, header-based).
- Lightweight deployment and minimal operational overhead.

Example: If you're building a microservices-based app where services exchange short-lived commands (like “create invoice”), RabbitMQ may be easier. If you're tracking millions of user events for analytics, Kafka will scale far better.

Kafka vs. Apache Pulsar

A More Apples-to-Apples Comparison

Apache Pulsar was designed to address some of Kafka’s architectural limitations.

Kafka has a monolithic architecture: storage and serving layers are co-located.
Pulsar separates these into a serving layer (brokers) and a storage layer (BookKeeper).

This allows Pulsar to:

Scale serving and storage independently.
Use tiered storage out of the box (e.g., write hot data to disk, cold data to S3).
Support multi-tenancy, geo-replication, and topic-level isolation natively.

However, Kafka:

Has a more mature ecosystem (Kafka Streams, Connect, Confluent platform).
Outperforms Pulsar in most high-throughput, low-latency workloads due to a more optimized write path.
Has broader enterprise adoption and support.

Use Pulsar if you need native multi-tenancy, tiered storage, or geo-replication baked in. Use Kafka if you want a battle-tested, high-performance streaming backbone with rich tooling.

Kafka vs. Amazon Kinesis

Fully Managed vs. Open Source Trade-Off

Kinesis is AWS’s fully managed Kafka-like service—but it’s not a drop-in replacement.

Kinesis:
- Has 7-day retention by default (extendable to 365 days for extra cost).
- Has higher read/write latencies (~100ms).
- Uses shards instead of partitions (and limits are stricter).
- Requires custom integrations for schema management and replay.
Kafka:
- Can run on-prem, in any cloud, or with managed services like Confluent Cloud.
- Offers greater control over retention, partitioning, replication, and tuning.
- Better for complex pipelines, especially when latency matters.

Use Kinesis when you’re fully on AWS and want low-maintenance event streaming. Use Kafka when you need portability, control, or tighter SLAs.

Kafka vs. Google Pub/Sub

Google Cloud Pub/Sub is Kafka-like in function but designed for serverless and cloud-native simplicity.

It supports:
- Automatic scaling.
- At-least-once delivery (exactly-once with Dataflow).
- 7-day message retention.
- Ordering by key (but must be enabled manually).

Limitations compared to Kafka:

Higher and more variable latency.
No built-in support for stream processing or log compaction.
Less mature integration ecosystem compared to Kafka + Confluent.

Pub/Sub is a good fit for cloud-native GCP apps that need lightweight, decoupled messaging. Kafka offers more depth for complex, stateful, and low-latency pipelines.

Choosing the Right Tool: Guidelines

If you need...	Consider...
High-throughput event pipelines	Kafka, Pulsar
Low-latency task queues	RabbitMQ
Fully managed cloud-native messaging	Kinesis, Pub/Sub
Replay and long-term event retention	Kafka, Pulsar
Strong multi-tenancy and isolation	Pulsar
Rich ecosystem for stream processing	Kafka
Built-in connectors and CDC support	Kafka + Connect
Serverless simplicity	Pub/Sub

Beyond Messaging: Kafka for Stream Processing

Kafka doesn’t just move data. It can process it, too.

Kafka Streams

Kafka Streams is a library for writing streaming applications in Java or Scala. It supports:

Stateful operations (e.g., aggregations, joins).
Time windows (e.g., "group orders by hour").
Fault-tolerant local state storage.

Example:
A streaming app joins orders and payments topics in real time and calculates revenue by product category every five minutes.

Kafka Connect

Kafka Connect is a framework for integrating Apache Kafka with external systems—databases, data warehouses, object stores, and more—without writing custom glue code. It simplifies data movement into and out of Kafka topics using a declarative configuration model and pluggable connectors.

Kafka Connect supports two connector types:

Source connectors: Ingest data from external systems into Kafka topics (e.g., MySQL, PostgreSQL, MongoDB, JDBC sources, file systems).
Sink connectors: Export data from Kafka topics to downstream systems (e.g., Amazon S3, Google BigQuery, Snowflake, Amazon Redshift, StarRocks).

Kafka Connect can run in two modes:

Standalone: For development or low-throughput jobs.
Distributed: For high availability, scalability, and fault tolerance in production environments.

Kafka Connect also integrates well with Schema Registry, Kafka REST Proxy, and Debezium (for Change Data Capture).

Example:
A company uses Debezium with Kafka Connect to capture real-time MySQL binlog changes, publish them into Kafka topics, and then write those events to StarRocks using the StarRocks sink connector for near-real-time analytics. StarRocks consumes the data via Kafka, maintains table synchronization, and enables fast queries on fresh data—ideal for use cases like dashboarding, user behavior analytics, or operational monitoring.

StarRocks + Kafka Integration:
StarRocks natively supports streaming ingestion from Kafka via its routine load mechanism or Kafka Connect sink connector. This allows users to continuously load data from Kafka topics into StarRocks tables with automatic parsing, transformation (e.g., JSON, CSV), and checkpointing. It is particularly useful for building real-time analytics pipelines, combining Kafka for event streaming and StarRocks for low-latency SQL queries on structured data.

Conclusion

Apache Kafka has fundamentally changed how we think about moving, storing, and processing data at scale. What started as a durable message bus at LinkedIn has evolved into a robust platform for event-driven architecture, stream processing, and real-time analytics.

Its core abstraction—a distributed, append-only log—enables Kafka to solve a wide variety of use cases with durability, replayability, and horizontal scalability at its foundation. Whether you're building a real-time fraud detection system, streaming logs from microservices, replicating database changes, or powering an analytics pipeline, Kafka gives you a unified and resilient way to handle data in motion.

Kafka is not a one-size-fits-all solution. In some scenarios, simpler systems like RabbitMQ, or fully managed services like Kinesis and Pub/Sub, might be better fits. But if you're designing systems that need to operate at scale with high availability, strong guarantees, and architectural flexibility, Kafka remains one of the most trusted and capable tools in the field.

If you’re getting started, begin by setting up a local instance. If you’re scaling, look into Kafka Connect, Kafka Streams, Schema Registry, and observability tooling. And if you’re deploying to production, pay attention to replication, retention, offset management, and client error handling.

In short, Kafka is not just a message system—it’s a core infrastructure layer for real-time data.

Frequently Asked Questions (FAQ)

1. Is Kafka a message queue or a log?

Kafka is fundamentally a distributed log, not a traditional queue. While it can act like a message queue (e.g., decoupling producers and consumers), it retains messages even after they're read. Consumers track their own progress, and messages are stored on disk for a configurable period.

2. What are the main differences between Kafka and RabbitMQ?

Kafka excels at high-throughput, event replay, and durable log-based processing. RabbitMQ is better for low-latency task distribution, synchronous workflows, and complex routing patterns. Kafka is optimized for real-time data pipelines and analytics; RabbitMQ is ideal for operational messaging between microservices.

3. Does Kafka guarantee message ordering?

Yes, but only within a single partition. If you require strict ordering for a stream of related events (e.g., all events for a specific user or order), you must use a key that ensures they go to the same partition.

4. What are Kafka’s delivery guarantees?

Kafka supports:

At-least-once delivery (default).
At-most-once (if you commit offsets before processing).
Exactly-once (with transactional producers and idempotent consumers, supported in Kafka 0.11+).

5. What is the difference between ZooKeeper mode and KRaft mode?

ZooKeeper was historically used for metadata management and leader election. KRaft (Kafka Raft Metadata mode) is a newer architecture (stable as of Kafka 3.5) that eliminates ZooKeeper and simplifies operations by internalizing metadata handling within Kafka itself.

6. Can Kafka be used as a database?

No, Kafka is not a database. However, it persists data reliably, supports queries through stream processors, and enables event sourcing. Some use Kafka in event-log-driven architectures where it serves as a source of truth, but it's not designed for transactional workloads or random access queries.

7. How does Kafka handle failures and ensure durability?

Kafka replicates each partition to multiple brokers (typically 3). One replica acts as the leader, others are followers. If the leader fails, Kafka automatically promotes a follower. Data is not acknowledged to producers until written to disk and replicated (based on acks setting).

8. What languages are supported for Kafka clients?

Kafka has official client libraries in Java and Scala, and mature community-supported clients in Python (e.g., confluent-kafka-python), Go, C/C++, Node.js, Rust, and others.

9. How does Kafka compare to Amazon Kinesis or Google Pub/Sub?

Kafka offers more control, portability, and flexibility. Kinesis and Pub/Sub are easier to operate (as managed services) but have trade-offs in latency, throughput tuning, and ecosystem richness. Kafka can be run on-prem, on Kubernetes, or used via Confluent Cloud.

10. What are Kafka Streams and Kafka Connect, and when should I use them?

Kafka Streams is a client library for writing real-time processing applications directly on top of Kafka topics. Use it for joins, aggregations, windowing, and transformations.
Kafka Connect is a data integration framework for syncing Kafka with external systems (e.g., databases, object stores, search engines). Use it when you need out-of-the-box connectors and minimal custom code.

11. Can Kafka handle multi-tenant workloads?

Kafka supports multi-tenancy at the infrastructure level (via ACLs, quotas, and topic isolation), but it doesn’t natively manage namespace-based isolation like Apache Pulsar. If you need strict tenant separation, you may need to architect around it or use tools like Confluent Platform or Kafka-on-Kubernetes setups.

12. What should I monitor in a Kafka cluster?

At minimum, monitor:

Broker health and uptime
Partition leader distribution
Consumer group lag
Disk usage and segment cleanup
Producer/consumer error rates
ZooKeeper (if in use)

Tools like Prometheus, Grafana, Confluent Control Center, and Burrow are commonly used.

13. What’s the typical learning curve for Kafka?

Kafka’s core concepts are simple, but operating it at scale and using its broader ecosystem (Streams, Connect, Schema Registry) requires a solid understanding of distributed systems. Expect a moderate-to-steep learning curve if you’re new to streaming architectures—but the payoff is worth it.

14. Is Kafka suitable for small teams or startups?

Yes—with caveats. Kafka can be overkill if your data volume or complexity is low. However, managed services (like Confluent Cloud, MSK, or Redpanda Cloud) make it easier for smaller teams to benefit from Kafka without the operational overhead.

Recommended Resources

Trino vs. StarRocks: Get Data Warehouse Performance on the Data Lake

Once praised for its data lake performance, Trino now struggles. Discover what's new in data lakehouse querying and why it's time to move to StarRocks.

5 Brilliant Lakehouse Architectures from Tencent, WeChat, and More

Explore 5 data lakehouse architectures from industry leaders that showcase how enhancing your query performance can lead to more than just compute savings.

Airbnb Builds a New Generation of Fast Analytics Experience with StarRocks

Learn from Airbnb's journey. Get a deep dive into how Airbnb developed their real-time data analytics infrastructure with StarRocks.

Apache Kafka in Action: Real-Time Messaging and Stream Processing