Debezium Blog
At ScyllaDB, we develop a high-performance NoSQL database Scylla, API-compatible with Apache Cassandra, Amazon DynamoDB and Redis. Earlier this year, we introduced support for Change Data Capture in Scylla 4.3. This new feature seemed like a perfect match for integration with the Apache Kafka ecosystem, so we developed the Scylla CDC Source Connector using the Debezium framework. In this blogpost we will cover the basic structure of Scylla’s CDC, reasons we chose the Debezium framework and design decisions we made.
It’s my pleasure to announce the second release of the Debezium 1.7 series, 1.7.0.Beta1!
This release brings NATS Streaming support for Debezium Server along with many other fixes and enhancements. Also this release is the first one tested with Apache Kafka 2.8.
When you are working with Kafka Connect Distributed then you might have realized that once you start Kafka Connect there are already some internal Kafka Connect related topics created for you:
$ kafka-topics.sh --bootstrap-server $HOSTNAME:9092 --list
connect_configs
connect_offsets
connect_statuses
This is done automatically for you by Kafka Connect with a sane, customized default topic configuration that fits the needs of these internal topics.
When you start a Debezium connector the topics for the captured events are created by the Kafka
broker based on a default, maybe customized, configuration in the broker if
auto.create.topics.enable = true
is enabled in the broker config:
auto.create.topics.enable = true
default.replication.factor = 1
num.partitions = 1
compression.type = producer
log.cleanup.policy = delete
log.retention.ms = 604800000 ## 7 days
But often, when you use Debezium and Kafka in a production environment you might choose to disable
Kafka’s topic auto creation capability with auto.create.topics.enable = false
, or you want the
connector topics to be configured differently from the default. In this case you have to create
topics for Debezium’s captured data sources upfront.
But there’s good news! Beginning with Kafka Connect version 2.6.0, this can be automated since
KIP-158
is implemented to enable customizable topic creation with Kafka Connect.
Although Debezium makes it easy to capture database changes and record them in Kafka, one of the more important decisions you have to make is how those change events will be serialized in Kafka. Every message in Kafka has a key and a value, and to Kafka these are opaque byte arrays. But when you set up Kafka Connect, you have to say how the Debezium event keys and values should be serialized to a binary form, and your consumers will also have to deserialize them back into a usable form.
Debezium event keys and values are both structured, so JSON is certainly a reasonable option — it’s flexible, ubiquitous, and language agnostic, but on the other hand it’s quite verbose. One alternative is Avro, which is also flexible and language agnostic, but also faster and results in smaller binary representations. Using Avro requires a bit more setup effort on your part and some additional software, but the advantages are often worth it.
Update (Oct. 11 2019): An alternative, and much simpler, approach for running Debezium (and Apache Kafka and Kafka Connect in general) on Kubernetes is to use a K8s operator such as Strimzi. You can find instructions for the set-up of Debezium on OpenShift here, and similar steps apply for plain Kubernetes.
Our Debezium Tutorial walks you step by step through using Debezium by installing, starting, and linking together all of the Docker containers running on a single host machine. Of course, you can use things like Docker Compose or your own scripts to make this easier, although that would just automating running all the containers on a single machine. What you really want is to run the containers on a cluster of machines. In this blog, we’ll run Debezium using a container cluster manager from Red Hat and Google called Kubernetes.
Kubernetes is a container (Docker/Rocket/Hyper.sh) cluster management tool. Like many other popular cluster management and compute resource scheduling platforms, Kubernetes' roots are in Google, who is no stranger to running containers at scale. They start, stop, and cluster 2 billion containers per week and they contributed a lot of the Linux kernel underpinnings that make containers possible. One of their famous papers talks about an internal cluster manager named Borg. With Kubernetes, Google got tired of everyone implementing their papers in Java so they decided to implement this one themselves :)
Kubernetes is written in Go-lang and is quickly becoming the de-facto API for scheduling, managing, and clustering containers at scale. This blog isn’t intended to be a primer on Kubernetes, so we recommend heading over to the Getting Started docs to learn more about Kubernetes.