DATABERG
Kafka Streams: explained
What is Kafka?
Apache Kafka is a horizontally scalable, robust open-source messaging platform that has made great headways to the data processing community in the last couple of years.
Kafka relies on a producer-consumer model, where you can use the APIs to connect to the underlying messages in the Topics (the Kafka category identifiers), both for reading and writing. But this basic functionality still lets a lot of work to the programmer to take care of, as it is a relatively low-level interface to the Kafka platform:you soon notice that you are repeating similarly structured supporting code from one application to another.
This observation of “excess fat” leads to a desire to increase the abstraction level of your processing applications that rely on Kafka, and luckily, there is help to be found:
Introduced in 2016, and fully mature since june 2017, the Kafka Streams client library is a game changer in the world of data. It allows the processing of data inside Kafka to happen as part of a standard Java or Scala application, with no need to create a separate cluster for processing.
Benefits of Kafka Streams
Kafka Streams is elastic, highly scalable and fault-tolerant, offering a processing latency that is on a millisecond level. it works exactly in the same manner whether it runs in a container, VM , cloud, or on premises. All three platforms (Linux, Mac, Windows) are supported.
And as a technology breakthrough in the computing world, Kafka is also known as the first streams processing library in the world that provides “exactly once” capability. This means the ability to execute a read-process-write cycle exactly one time, neither missing any input messages, nor producing duplicate output messages.
Kafka Streams is also a non-batch (non-buffering) system, meaning that it processes its streams one record at a time, yet it supports stateless, stateful or windowed operations on data.
The stream processing code inside the Kafka Streams becomes part of your application, and takes care of all interactions with a Kafka cluster. Hence the stream processing does not execute on Kafka brokers. From your point of view, Kafka Streams is just another JAR that you are adding to your application, and your application platform directly defines the available processing power.
You can see Kafka as the system for organized management of your data streams, and Kafka Streams as the means to do computational transformations on the data, relieving you from the worries of the internal interactions with the Kafka cluster. This simplification of Kafka interactions allows you to adapt Kafka to a wide variety of use cases, especially extending its use to the low end of the spectrum.
Datumize and Kafka
Datumize has recently extended their Datumize Data Aggregator (DDA) application to run on Kafka Streams, often used in combination with the Datumize Data Collector (DDC) as the capture-compute node on the edge. This combination allows a lightweight setup for real-time data acquisition and processing, offering a flexible platform that can do considerable pre-processing and enhancing of real-time data sources prior to their ingestion to back-end systems and data lakes.
With the combined power of DDC and DDA, you can extract just the direct or derived data you need, reducing the amount of volume entering your back end systems, and thus reducing the cost and bandwidth requirements.
And if your needs change, the configuration for data acquisition and processing for DDC and DDA can be easily changed via the graphical user interface of Datumize Zentral.