Share on Social Media

Learn how to install Apache Kafka on CentOS 8 with this step-by-step guide. Covering prerequisites, installation commands, and configuration tips, you’ll have Kafka up and running in no time. #centlinux #linux #kafka

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming applications. Initially developed by LinkedIn and later open-sourced under the Apache Software Foundation, Kafka is designed to handle high-throughput, low-latency data processing.

Key Features of Apache Kafka:

  • High Throughput: Kafka is capable of handling millions of messages per second, making it suitable for large-scale data applications.
  • Scalability: It can be easily scaled horizontally by adding more brokers to distribute the load and ensure high availability.
  • Durability: Kafka persists messages on disk and replicates them across multiple brokers to ensure data durability and fault tolerance.
  • Fault Tolerance: The system is designed to be highly resilient, with the ability to recover from node failures seamlessly.
  • Real-Time Processing: Kafka allows real-time processing of streams of data, which is crucial for applications needing instant insights and actions.
  • Decoupling of Systems: Kafka decouples data producers from consumers, enabling independent scaling and evolution of different parts of a data architecture.
  • Multiple Consumers: A single stream of data can be consumed by multiple applications, allowing for versatile data processing pipelines.

Core Concepts:

  • Producers: Applications that send data to Kafka topics.
  • Consumers: Applications that read data from Kafka topics.
  • Brokers: Kafka servers that store data and serve client requests.
  • Topics: Categories or feeds to which records are published.
  • Partitions: Subdivisions of topics to allow parallel processing and scalability.
  • Replicas: Copies of partitions distributed across brokers to ensure fault tolerance.

Common Use Cases:

  • Log Aggregation: Collecting and processing logs from various sources in a centralized manner.
  • Real-Time Analytics: Processing streams of data in real-time for instant insights and actions.
  • Data Integration: Facilitating the integration of different data systems by streaming data between them.
  • Event Sourcing: Capturing changes in the state of applications as a sequence of events.

Apache Kafka’s ability to handle large-scale, real-time data streams with high reliability makes it a popular choice for various industries, including finance, telecommunications, retail, and more. Its robust architecture and extensive ecosystem of tools and connectors make it a versatile and powerful platform for modern data-driven applications.

Recommended Online Training: Apache Kafka Series – Kafka Cluster Setup & Administration

1141702 3eed 8show?id=oLRJ54lcVEg&offerid=1074530.1141702&bids=1074530

Apache Flink and Apache Kafka are both powerful tools for handling large-scale data, but they serve different purposes and are often used together in a complementary manner. Here’s a comparison of the two:

Apache Kafka

Purpose: Kafka is a distributed event streaming platform designed to handle high-throughput, real-time data feeds. It is primarily used for messaging, log aggregation, and real-time data pipelines.

Key Features:

  1. Message Broker: Kafka acts as a high-throughput, fault-tolerant message broker, allowing systems to publish and subscribe to streams of records.
  2. Durability and Fault Tolerance: Kafka ensures data durability by persisting messages on disk and replicating them across multiple brokers.
  3. Scalability: Kafka is designed to scale horizontally by adding more brokers to handle increased load.
  4. Low Latency: Kafka provides low-latency message delivery, suitable for real-time data processing.
  5. Decoupling Systems: Kafka decouples data producers from consumers, enabling independent scaling and evolution of different parts of a data architecture.
  6. Multiple Consumers: A single stream of data can be consumed by multiple applications for various use cases.

Common Use Cases:

  • Real-time analytics
  • Log aggregation
  • Stream processing pipelines
  • Event sourcing

Purpose: Flink is a stream processing framework that excels at complex event processing and real-time analytics. It is designed for high-performance, low-latency stream and batch data processing.

Key Features:

  1. Stream Processing: Flink provides robust support for stateful stream processing, allowing for real-time data transformation and analytics.
  2. Low Latency: Flink is optimized for low-latency processing, enabling near-instantaneous analysis and actions on data streams.
  3. Fault Tolerance: Flink’s checkpointing mechanism ensures state consistency and recovery in case of failures.
  4. Scalability: Flink can scale horizontally to handle large volumes of data by adding more nodes to the cluster.
  5. Complex Event Processing: Flink supports complex event processing (CEP) with its powerful event pattern matching capabilities.
  6. Batch Processing: In addition to stream processing, Flink can handle batch data processing, making it versatile for various data workloads.

Common Use Cases:

  • Real-time data analytics
  • Stream and batch data processing
  • Complex event processing
  • Machine learning pipelines
  • Data enrichment

Comparison Summary

  • Functionality: Kafka is primarily a message broker and event streaming platform, while Flink is a stream processing framework designed for complex event processing and real-time analytics.
  • Use Cases: Kafka is used for data ingestion, buffering, and event storage, whereas Flink is used for processing and analyzing data streams in real-time.
  • Integration: Kafka and Flink are often used together, where Kafka handles data ingestion and Flink processes the ingested data in real-time.
  • Scalability and Fault Tolerance: Both systems are highly scalable and fault-tolerant, but they achieve this through different mechanisms tailored to their specific use cases.

In summary, Apache Kafka and Apache Flink serve distinct but complementary roles in a modern data architecture. Kafka is ideal for real-time data streaming and event storage, while Flink excels at processing and analyzing those streams in real-time. Using them together leverages the strengths of both platforms for building robust, scalable, and real-time data-driven applications.

Environment Specification

We are using a minimal CentOS 8 KVM machine with following specifications.

  • CPU – 3.4 Ghz (2 cores)
  • Memory – 2 GB
  • Storage – 20 GB
  • Operating System – CentOS 8.2
  • Hostname – kafka-01.centlinux.com
  • IP Address – 192.168.116.234 /24

Read Also: How to install Apache Solr Server on CentOS 8

Update your Linux Operating System

Connect with kafka-01.centlinux.com as root user with the help of a ssh client.

Update installed sofware packages on your Linux operating system. We are using CentOS Linux in this installation guide, therefore, you can use dnf command for this purpose.

# dnf update -y

Check the Linux operating system and Kernel version that was used in this installation guide.

# uname -r
4.18.0-193.28.1.el8_2.x86_64

# cat /etc/redhat-release
CentOS Linux release 8.2.2004 (Core)

Install Java on CentOS 8

Apache Kafka is built using Java programming language, therefore it requires Java Development Kit 8 or later.

JDK 11 is available in standard yum repositories, therefore, you can install JDK 11 by executing following Linux command.

# dnf install -y java-11-openjdk

Install Apache Kafka on CentOS 8

Kafka server is distributed under Apache License 2.0, therefore you can download this software from their offical website.

Apache Kafka Downloads
Apache Kafka Downloads

Copy the URL of your required version of Apache Kafka software from this webpage.

Use the copied URL with wget command to download the Apache Kafka software directly from Linux command line.

# cd /tmp
# wget https://downloads.apache.org/kafka/2.6.0/kafka_2.13-2.6.0.tgz
Download Apache Kafka
Download Apache Kafka

Extract downloaded tarball by using tar command.

# tar xzf kafka_2.13-2.6.0.tgz

Now, move the extracted files to /opt/kafka directory.

# mv kafka_2.13-2.6.0 /opt/kafka

Install ZooKeeper on CentOS 8

Current versions of Kafka server requires Zookeeper service for distributed configurations. However, it is mentioned in Kafka documentation that

“Soon, ZooKeeper will no longer be required by Apache Kafka.”

But for now, you have to install Apache Zookeeper service before Kafka server.

Zookeeper binary scripts are provided with Kafka setup files. You can use it to configure ZooKeeper server.

Create a systemd service unit for Apache Zookeeper.

# cd /opt/kafka/
# vi /etc/systemd/system/zookeeper.service

Add following directived in this file.

[Unit]
Description=Apache Zookeeper server
Documentation=http://zookeeper.apache.org
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/usr/bin/bash /opt/kafka/bin/zookeeper-server-start.sh /opt/kafka/config/zookeeper.properties
ExecStop=/usr/bin/bash /opt/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal

[Install]
WantedBy=multi-user.target

Create Systemd Service for Apache Kafka

Similarly, create a systemd service unit for Kafka server.

# vi /etc/systemd/system/kafka.service

Add following directives therein.

[Unit]
Description=Apache Kafka Server
Documentation=http://kafka.apache.org/documentation.html
Requires=zookeeper.service

[Service]
Type=simple
Environment="JAVA_HOME=/usr/lib/jvm/jre-11-openjdk"
ExecStart=/usr/bin/bash /opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/usr/bin/bash /opt/kafka/bin/kafka-server-stop.sh

[Install]
WantedBy=multi-user.target

Enable and start Apache Zookeeper and Kafka services.

# systemctl daemon-reload
# systemctl enable --now zookeeper.service
Created symlink /etc/systemd/system/multi-user.target.wants/zookeeper.service â /etc/systemd/system/zookeeper.service.
# systemctl enable --now kafka.service
Created symlink /etc/systemd/system/multi-user.target.wants/kafka.service â /etc/systemd/system/kafka.service.

Verify the status of Apache Kafka service.

# systemctl status kafka.service
Apache Kafka Server Status
Apache Kafka Server Status

Create a Topic in Apache Kafka Server

Create a topic in your Apache Kafka server.

# /opt/kafka/bin/kafka-topics.sh --create --topic centlinux --bootstrap-server localhost:9092
Created topic centlinux.

To view the details of the topic, you can use run following script at the Linux command line.

# /opt/kafka/bin/kafka-topics.sh --describe --topic centlinux --bootstrap-server localhost:9092
Topic: centlinux        PartitionCount: 1       ReplicationFactor: 1    Configs: segment.bytes=1073741824
        Topic: centlinux        Partition: 0    Leader: 0       Replicas: 0    Isr: 0

Add some sample events in your topic.

# /opt/kafka/bin/kafka-console-producer.sh --topic centlinux --bootstrap-server localhost:9092
>This is the First event.
>This is the Second event.
>This is the Third event.
>^C#

To view all the events that are inserted into a topic, you can execute following script at Linux command line.

# /opt/kafka/bin/kafka-console-consumer.sh --topic centlinux --from-beginning --bootstrap-server localhost:9092
This is the First event.
This is the Second event.
This is the Third event.
^CProcessed a total of 3 messages

Apache Kafka is successfully installed on CentOS / RHEL 8 and the bootstrap server is running at port 9092.

To improve your skills in this area, we recommend that you should read Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale 1st Edition (PAID LINK) by O’Reilly Media.

Final Thoughts

Installing Apache Kafka on CentOS 8 can greatly enhance your data streaming capabilities. By following this guide, you should now have Kafka successfully installed and configured. If you need further assistance or professional support for your Linux server, feel free to check out my Fiverr services for expert Linux server administration: Linux Server Admin.

Leave a Reply