A closer look at Redpanda
Is it worth using Kafka anymore?
Founded in 2019, Redpanda is a Kafka-compatible streaming platform that aims to unify historical and real-time data processing. Written in C++, it offers less operational complexity with better throughput and latency guarantees. In some sense it is similar to what ScyllaDB is to Cassandra.
Technically speaking, the major difference is that Redpanda was built with Raft as the consensus protocol from ground up. Recently Kafka also switched to their version of Raft called KRaft. This proves that Redpanda’s choice was the right one.
If you ever maintained a large-scale Kafka cluster then you know how painful that is. With recent advancements of managed cloud services this argument might seem less relevant but for me, it really isn’t. Using managed services has its drawbacks and a good solution architect should weight them in. For instance, you have limited possibilities of customizing the setup, optimizing it for your specific workload, installing additional monitoring tools on the nodes and so on. These things has to be considered.
Kafka is de facto a standard streaming platform of choice. During my career, I haven’t worked with streaming project that didn’t consider Kafka at its core. I would say that the criteria we used during our choice were as follows
- people are in general familiar with Kafka API and it is pleasant to work with,
- the topic, partition, offset, partition group model of Kafka is well understood and suits well in many use cases (also outside real-time streaming domain),
- it offers low latency and is highly reliable,
- we know that it can scale well.
In reality though, I’ve seen companies struggling to get their Kafka clusters up under heavy load. Retention policies were the go to strategy to reduce the memory pressure on the cluster, making it more and more difficult to recover from unexpected and severe failures. If your data dumping infrastructure cannot keep up too, you are doomed to lose some data. Not to mention, that not all real-time platforms are designed to process historical data from external, cold storage. This violates third and fourth points.
On passage of time
Kafka has been designed a long time ago. The world ten years ago looked very different from what it looks right now. As a professionals, it is difficult to keep up with the advancements in your niche, I get it. Unfortunately for us, all tech sectors aren’t slowing down. You might remember that Hadoop was designed to accommodate large amount of not-so-powerful machines to work together on a high-scale task. Kafka was born based on similar principles.
Modern infrastructure though is way more flexible. Provisioning a machine with 96 vCPUs with tens of terabytes of SSD storage is just one click away (assuming that your wallet can handle this). We shouldn’t forget that network is orders of magnitude slower than local alternatives. Disks got 1000x faster, internet connection got 100x faster, leveraging single machine performance is something that projects such as Redpanda doesn’t dare to overlook. Modern C++ help developers squeeze this performance. Memory locality, thread per core model and async scheduling frameworks like Seastar are at the heart of projects like Redpanda or Scylla.
Advantages of Redpanda
Operational problems of Kafka usually start with the problems with JVM. Large heap sizes are difficult to work with and cause all sorts of GC related problems. Memory management in C++ is way different and more suited for applications that operate on huge data velocity. Moreover, it bypasses OS I/O layer and the team implemented their own memory management toolkit.
With no-locking, thread-local model, Redpanda scales well both vertically and horizontally. All code-level optimizations yield predictible and low latency values. Redpanda team claims up to 10x better latency overall with about 40x better p99 latency. In certain use cases, distributed message broker may impact end user experience and lowering tail latency is an important factor.
Moreover, it support shadow indexing. Strict retention policies that are often configured for Kafka topics make it impossible to query historical data directly from Kafka. Redpanda solves this with a strategy called shadow indexing. In short, it will asynchronously dump data to cheap, cold storage (such as GCS or S3) and it allows users to seamlessly read that data back if needed. Conceptually, you can query Redpanda partition for any historical offset. It will determine which object to pull from object storage and retrieve the data for you. You can read more about it here.
Remote read replica is a cool feature that essentially is a byproduct of shadow indexing. Imagine a scenario where your Kafka cluster struggles to handle so many consumers trying to access topics data. We are familar with read / write replicas for databases such as PostgreSQL or MongoDB. With tiered storage behind Redpanda, you could consume topic’s data straight from the object storage, effectively creating a read-only replica in another cluster. That’s exactly what remote read replicas are.
It also ships with built-in schema registry. All in the same binary. And have I mentioned that you can use existing solutions such as Kafka Connect with Redpanda?
More features to come
They also use WASM for stream transformations so that you don’t need to employ external streaming engine for simple use cases such as PII data masking. WebAssembly lets developers write code in any language and translate it to a portable format that can be later executed at native speed on variety of platforms. Data transformations are experimental features at this moment.
Assessing correctness
Have somebody tried to verify Redpanda’s claims? If you never heard about Jepsen, over the years they performed analysis on data platforms — exploiting data loss, skews, lock conflicts and more. They also performed analysis of Redpanda (note: funded by Redpanda team). Most of the problems found are already fixed, some included problems with Kafka protocol itself (it is a bit ambiguous what Kafka transactions really are. Up until reading this report I have never thought about this — but do you know what guarantees do Kafka transactions offer and how do they differ with what we know from relational databases world?).
As with any relatively complex system, usually we don’t possess full knowledge about its strengths and weaknesses. The most severe problem that is yet unsolved is that certain cluster failures may lead to lost messages. The good news is that those messages are present on the nodes, they just can’t be queried. Nonetheless, Redpanda might not yet be there with all the edge cases present in the distributed world. I would like to see more tests like this in the future, for both aspiring and well-grounded technologies.
Conclusion
I tried to show you what the Redpanda is and how does it compare to Kafka. From ease of use perspective, it is my go to choice for all personal projects (and my MacBook seems to agree that it indeed is better optimized than Kafka). Redpanda is an interesting project bringing a lot of great ideas to the streaming world. It provides useful improvements such as shadow indexing, remote read replicas and WebAssembly support.
As with any relatively new technology, it needs to get more attention and support from the community to fix all existing problems and become a serious competitor to Kafka. For me personally, Redpanda is way more serious competitor to Kafka than Pulsar.