Debezium Blog

Kafka Streams is a library for developing stream processing applications based on Apache Kafka. Quoting its docs, "a Kafka Streams application processes record streams through a topology in real-time, processing data continuously, concurrently, and in a record-by-record manner". The Kafka Streams DSL provides a range of stream processing operations such as a map, filter, join, and aggregate.

Non-Key Joins in Kafka Streams

Debezium’s CDC source connectors make it easy to capture data changes in databases and push them towards sink systems such as Elasticsearch in near real-time. By default, this results in a 1:1 relationship between tables in the source database, the corresponding Kafka topics, and a representation of the data at the sink side, such as a search index in Elasticsearch.

In case of 1:n relationships, say between a table of customers and a table of addresses, consumers often are interested in a view of the data that is a single, nested data structure, e.g. a single Elasticsearch document representing a customer and all their addresses.

This is where KIP-213 ("Kafka Improvement Proposal") and its foreign key joining capabilities come in: it was introduced in Apache Kafka 2.4 "to close the gap between the semantics of KTables in streams and tables in relational databases". Before KIP-213, in order to join messages from two Debezium change event topics, you’d typically have to manually re-key at least one of the topics, so to make sure the same key is used on both sides of the join.

Thanks to KIP-213, this isn’t needed any longer, as it allows to join two Kafka topics on fields extracted from the Kafka message value, taking care of the required re-keying automatically, in a fully transparent way. Comparing to previous approaches, this drastically reduces the effort for creating aggregated events from Debezium’s CDC events.

Kafka Streams is a library for developing applications for processing records from topics in Apache Kafka. A Kafka Streams application processes record streams through a topology in real-time, processing data continuously, concurrently, and in a record-by-record manner. The Kafka Streams DSL provides a range of stream processing operations such as a map, filter, join, and aggregate.

Non-Key Joins in Kafka Streams

Apache Kafka 2.4 introduced the foreign key join (KIP-213) feature, in order "to close the gap between the semantics of KTables in streams and tables in relational databases". Before KIP-213, in order to join messages from two Debezium change event topics, you’d have to manually re-key the topic(s); so to make sure the same key is used on both sides of the join. Thanks to KIP-213, this isn’t needed any longer, as it makes relational data liberated by connection mechanisms far easier to use, smoothing a transition to natively-built event-driven services. This resolved the problem of out-of-order processing due to foreign key changes in tables. Here is an interesting post by Hans-Peter Grahsl and Gunnar Morling on creating aggregated events from Debezium’s CDC events.

Debezium’s CDC source connectors makes it easy to capture data changes in databases and push them towards Elasticsearch in near real-time. This results in a 1:1 relationship between tables in the source database and a corresponding search index in Elasticsearch. Incase of 1:many relationship, say two tables with different primary keys and a foreign key relationship, Debezium provides a message.key.columns option. By choosing that foreign key column as the message key for change events, the two table streams could be joined without the need for re-keying the topic manually. The message.key.columns option can be used as:

message.key.columns=dbserver1.inventory.customerdetails:CustomerId

One of the typical Debezium uses cases is to use change data capture to integrate a legacy system with other systems in the organization. There are multiple ways how to achieve this goal

  • Write data to Kafka using Debezium and follow with a combination of Kafka Streams pipelines and Kafka Connect connectors to deliver the changes to other systems

  • Use Debezium Embedded engine in a Java standalone application and write the integration code using plain Java; that’s often used to send change events to alternative messaging infrastructure such as Amazon Kinesis, Google Pub/Sub etc.

  • Use an existing integration framework or service bus to express the pipeline logic

This article is focusing on the third option - a dedicated integration framework.

Release early, release often! After the 1.1 Beta1 and 1.0.1 Final releases earlier this week, I’m today happy to share the news about the release of Debezium 1.1.0.Beta2!

The main addition in Beta2 is support for integration tests of your change data capture (CDC) set-up using Testcontainers. In addition, the Quarkus extension for implementing the outbox pattern as well as the SMT for extracting the after state of change events have been re-worked and offer more configuration flexibility now.

This article is a dive into the realms of Event Sourcing, Command Query Responsibility Segregation (CQRS), Change Data Capture (CDC), and the Outbox Pattern. Much needed clarity on the value of these solutions will be presented. Additionally, two differing designs will be explained in detail with the pros/cons of each.

So why do all these solutions even matter? They matter because many teams are building microservices and distributing data across multiple data stores. One system of microservices might involve relational databases, object stores, in-memory caches, and even searchable indexes of data. Data can quickly become lost, out of sync, or even corrupted therefore resulting in disastrous consequences for mission critical systems.

Solutions that help avoid these serious problems are of paramount importance for many organizations. Unfortunately, many vital solutions are somewhat difficult to understand; Event Sourcing, CQRS, CDC, and Outbox are no exception. Please look at these solutions as an opportunity to learn and understand how they could apply to your specific use cases.

As you will find out at the end of this article, I will propose that three of these four solutions have high value, while the other should be discouraged except for the rarest of circumstances. The advice given in this article should be evaluated against your specific needs, because, in some cases, none of these four solutions would be a good fit.

Outbox as in that folder in my email client? No, not exactly but there are some similarities!

The term outbox describes a pattern that allows independent components or services to perform read your own write semantics while concurrently providing a reliable, eventually consistent view to those writes across component or service boundaries.

You can read more about the Outbox pattern and how it applies to microservices in our blog post, Reliable Microservices Data Exchange With the Outbox Patttern.

So what exactly is an Outbox Event Router?

In Debezium version 0.9.3.Final, we introduced a ready-to-use Single Message Transform (SMT) that builds on the Outbox pattern to propagate data change events using Debezium and Kafka. Please see the documentation for details on how to use this transformation.

Last week’s announcement of Quarkus sparked a great amount of interest in the Java community: crafted from the best of breed Java libraries and standards, it allows to build Kubernetes-native applications based on GraalVM & OpenJDK HotSpot. In this blog post we are going to demonstrate how a Quarkus-based microservice can consume Debezium’s data change events via Apache Kafka. For that purpose, we’ll see what it takes to convert the shipment microservice from our recent post about the outbox pattern into Quarkus-based service.