By Mark Bittmann, Partner & Lead Data Scientist at B23 LLC, a Big Data and Cloud Computing Professional Services Implementation Company bringing new age innovation to Big Data
In a recent blog, Martin Kleppmann laid out the reasons for and advantages of an immutable, event-based streaming data architecture — you can find it here: ‘Stream processing, Event sourcing, Reactive, CEP… and making sense of it all’. A few of our customers are currently migrating to this model. ACID transactions are still very important to a modern data platform, and many applications are still writing to and reading from a mutable database as a single source of truth for the enterprise. A change to a database table means: “something happened,” and when something happens, other applications might want to know about it. Those consumer applications include data warehouses, operational analytics platforms, or machine learning models.
We recently had a need to pull results from MySQL binary logs and push them to Apache Kafka. Several tools exist for parsing the MySQL binary log files, and several of them pipe directly to Kafka. These tools manage binlog offsets and table schemas externally, and are much more operationally friendly than tailing the binlog files directly. Our search centered on Maxwell and mypipe, but we found that Maxwell was really easy to use, and Zendesk has it running in production. There is one for PostgreSQL called Bottled Water. You just point these tools at your (binlog enabled) database, and you can have a streaming event log on a Kafka topic.
Once on Kafka, you have tremendous opportunity for a complex data workflow, such as real time processing with Storm, Flink, or Spark Streaming or writing to destination endpoints such as Elasticsearch, HDFS, Cassandra, or S3. There has been major progress recently on data workflow tools, including Apache NiFi, StreamSets, and Kafka Connect (all open source!). We chose StreamSets for our data pipelines.
All of the binlog replicator tools — Bottled Water, mypipe, Maxwell — are daemon utilities. Our initial solution had Kafka as the first “source” of incoming data in the pipeline, with Maxwell running outside the scope of the pipeline. As one component of a larger data workflow, we wanted a centralized platform for all in-motion event data. The real data “source” was the binlog. Wouldn’t it be great to manage the binlog replicator with a data pipeline tool as well? We thought so too, so we implemented a custom “Origin” for Maxwell in StreamSets and we extended a custom Maxwell “Producer” that emitted the data as StreamSets records rather than producing to Kafka.
In most cases, we will still send the binlog events directly to a Kafka topic, but using StreamSets for that orchestration has a lot of advantages in both flexibility and capability. We now have standardized configuration and monitoring of our Maxwell processes along with the full data pipeline, such as a Kafka consumer that writes to both S3 and Elasticsearch. We also have a lot more flexibility with destination. While Maxwell can partition a single Kafka topic based on database or table name, we can now configure a StreamSets pipeline to dynamically emit to a different topic based on the table name from the event log. We can also build more complex pipelines, such as implementing logic to detect a schema change and issue an alert accordingly.
Lastly, we can standardize on a data format at the source. Maxwell emits JSON to Kafka, and even the creators are still wondering whether Avro is a better choice (mypipe and Bottled Water use Avro). The use of a data pipeline tool allows you to more loosely decouple the data formats at the source and destination.
The streaming event log is becoming a critical part of the enterprise data platform. The more we can commoditize the transport of events onto and off of the distributed log, the faster we can innovate on the data.