AWS Kafka Explained: Understanding AWS MSK, the Managed Kafka Service

Introduction
In today's data-driven world, the ability to process, analyze, and react to data in real time is more critical than ever. This is where Apache Kafka shines. Originally developed by LinkedIn and now maintained by the Apache Software Foundation, Apache Kafka is a powerful open-source platform designed for building real-time data pipelines and streaming applications. It's engineered to handle high-throughput, low-latency data processing, making it a go-to solution for many organizations looking to harness the power of real-time data.
Why Use AWS Kafka? Key Use Cases
AWS Kafka is a versatile tool, and its use cases span across various industries and applications. Here are a few scenarios where Kafka is particularly effective:
- Real-Time Analytics: AWS Kafka allows you to process and analyze streams of data in real time, enabling businesses to gain insights and take actions immediately. Whether it’s monitoring user behavior on a website or analyzing financial transactions, Kafka provides the low-latency data processing needed for real-time analytics.
- Event Sourcing: In event-driven architectures, Kafka is often used to capture and log all changes to an application's state as a sequence of immutable events. This makes it easier to rebuild the state of an application at any point in time, which is crucial for audibility and debugging.
- Log Aggregation: Kafka can be used to collect logs from different systems and make them available in a central location for monitoring and analysis. This is particularly useful in large distributed systems where logs are generated across multiple servers and services.
- Data Integration: Kafka serves as a robust backbone for integrating data across different systems. By streaming data from one system to another in real time, Kafka ensures that data is consistently updated and synchronised across all parts of an organization.
However, while Kafka's capabilities are vast, managing the underlying infrastructure and scaling it to meet business demands can be complex and time-consuming. Enter Amazon Managed Streaming for Apache Kafka (MSK), a fully managed service that takes the burden of managing Kafka off your shoulders. AWS MSK simplifies the deployment, management, and scaling of Kafka clusters, allowing you to focus on building applications that can analyse and act on data in real-time, without worrying about the underlying infrastructure.
Amazon MSK: A Managed Kafka Service

Amazon MSK simplifies the deployment and operation of Apache Kafka clusters on AWS by automating many of the manual tasks associated with running Kafka, such as hardware provisioning, software patching, and cluster scaling. In addition to these benefits, MSK also eliminates the complexities of managing cluster metadata and leader election. Whether using older Kafka versions that rely on Zookeeper or newer versions with embedded KRaft, AWS handles these critical aspects seamlessly. This allows you to focus on building and optimizing your data streams without the burden of managing the underlying infrastructure, which can be particularly challenging when using open-source Kafka.
AWS offers two primary pricing models for MSK: MSK Provisioned and MSK Serverless. Understanding the differences between these models is essential for optimizing your deployment and managing expenses effectively.
Breaking Down the Differences: MSK Provisioned vs MSK Serverless

When choosing between Provisioned and Serverless AWS Kafka, it's important to understand the key differences in how they operate and the benefits they offer:
Control vs. Simplicity:
- MSK Provisioned: Offers more control over your Kafka cluster's infrastructure, allowing you to customize instance types, storage, and networking configurations. This option is ideal for users with specific performance requirements or those who need to optimize resource usage closely.
- MSK Serverless: Prioritises ease of use by abstracting away infrastructure management. AWS automatically handles provisioning, scaling, and managing Kafka clusters, making it perfect for teams that want to focus on application development rather than infrastructure.
Scaling Capabilities:
- MSK Provisioned: Requires manual scaling of broker instances and storage. You need to monitor and adjust resources based on your workload, which gives you fine-tuned control but also demands more management effort.
- MSK Serverless: Automatically scales based on the throughput of your data streams. Storage and compute resources adjust dynamically, providing a seamless experience without the need for manual intervention.
Pricing Overview:
- MSK Provisioned: Pricing is based on the number and type of broker instances, storage used, and additional features like data transfer. This model can be more cost-effective if you have predictable workloads.
- MSK Serverless: Charges are based on the volume of data processed, the number of partitions, and the amount of storage used. This model is more flexible and can be cost-efficient for variable workloads, as you only pay for what you actually use.
Use Cases:
- MSK Provisioned: Suited for environments where predictability and control are crucial, such as large-scale enterprise deployments with specific compliance or performance needs.
- MSK Serverless: Best for applications that need to quickly scale up or down without the overhead of managing infrastructure, such as startups, development environments, or applications with unpredictable traffic patterns.
By understanding these differences, you can choose the MSK option that best aligns with your workload requirements and operational goals.
How AWS Charges for MSK
MSK Provisioned:
- Broker Instance Hourly Charge:- Based on instance type (e.g., kafka.m5.large - $0.21 per hour).
 
- Storage Costs:- Standard storage: $0.10 per GB-month.
- Low-cost storage: $0.06 per GB-month.
- Data retrieval from low-cost storage: $0.0015 per GB.
- Additional storage throughput (optional): $0.08 per MB/s-month.
 
- Data Transfer Charges:- Standard AWS data transfer charges apply for data transferred in and out of MSK clusters.
 
- Multi-VPC Private Connectivity (Optional):- Hourly rate per cluster and authentication scheme: $0.02250 per hour.
- Data processed through private connectivity: $0.00600 per GB.
 
MSK Provisioned Pricing Example:
Suppose you have three Kafka broker instances in the US-East-2 (Ohio) region, each using kafka.m5.large instances. Here’s how the costs break down:
Broker Instance Costs:
- Each kafka.m5.large instance costs $0.21 per hour.
- For three brokers, this totals $0.63 per hour.
Storage Costs:
Assume each broker uses 1,000 GB of storage.
- Standard storage costs $0.10 per GB-month, so 3,000 GB would cost $300 per month.
Total Monthly Costs: $753.60 per month
- Broker instances: $0.63 per hour × 24 hours × 30 days = $453.60
- Storage: $300 per month
MSK Serverless:
- Cluster and Partition Costs:- Hourly rate for clusters: $0.75 per cluster-hour.
- Hourly rate for each partition: $0.0015 per partition-hour.
 
- Data Transfer Costs:- Data In: $0.10 per GB.
- Data Out: $0.05 per GB.
 
- Storage Costs:- Storage retained: $0.10 per GB-month.
 
MSK Serverless Pricing Example:
Now, consider a scenario where you use MSK Serverless with the following conditions:
Cluster Costs:
Assume the cluster is active for 720 hours in a month.
- The hourly rate is $0.75 per cluster-hour.
- Total: $0.75 × 720 hours = $540 per month
Partition Costs:
Assume you have 10 partitions.
- The hourly rate per partition is $0.0015.
- Total: $0.0015 × 10 partitions × 720 hours = $10.80 per month
Storage Costs:
Suppose you store 1,000 GB of data.
- Storage costs $0.10 per GB-month, totaling $100 per month.
Data Transfer Costs:
Assume 2,000 GB of data is transferred out of the cluster and 1,000 GB of data is transferred into the cluster.
- The cost for data transfer out is $0.05 per GB, resulting in a total of $0.05 × 2,000 GB = $100 per month.
- The cost for data transfer in is $0.10 per GB, resulting in a total of $0.10 × 1,000 GB = $100 per month.
Total Data Transfer Cost: $100 (out) + $100 (in) = $200 per month.
Total Monthly Costs:
- Cluster: $540 per month
- Partitions: $10.80 per month
- Storage: $100 per month
- Data transfer: $200 per month
Total: $850.80 per month
For the pricing examples, we used costs specific to the US East-2 (Ohio) region
Cost Saving Opportunities for MSK
Managing costs effectively is crucial when operating with AWS MSK. Here are some strategies to save on costs:
Utilise Low-Cost Storage:
- Opt for low-cost storage ($0.06 per GB-month) for data that doesn't require frequent access, significantly reducing your storage costs.
Optimise Data Transfer:
- Be mindful of data transfer in and out of MSK clusters to minimize data transfer charges.
Right-size Your Clusters:
- Regularly monitor and adjust the number of broker instances based on your workload needs. Refer to the AWS documentation to rightsize your cluster according to the partition count. If the number of partitions per broker exceeds the recommended value, your cluster may become overloaded.
- Broker size Recommended number of partitions (including leader and follower replicas) per broker

Leverage Multi-VPC Connectivity Wisely:
- Use Multi-VPC private connectivity only when necessary and ensure efficient data processing to avoid unnecessary costs.
MSK Best Practices
Running MSK efficiently involves adhering to best practices to ensure optimal performance and cost management:
Monitoring and Metrics:
- Utilize the four levels of monitoring available in MSK, with the DEFAULT level being free. Key metrics to monitor include BytesInPerSec, BytesOutPerSec, CpuIdle, KafkaAppLogsDiskUsed, and MemoryUsed.
Monitor Storage:
Use Automatic Scaling.
- AWS MSK’s auto-scaling for storage automatically adjusts capacity as your data grows, ensuring seamless performance and reducing the need for manual management. This feature helps maintain smooth operations and prevents disruptions by scaling storage.
- Manage storage costs and prevent disruptions by reducing the message retention period or log size in Kafka. This limits the amount of data kept on disk and minimizes the risk of running out of storage.
- Delete unused topics to free up storage space.
Broker Management:
- AWS Kafka supports adding and removing brokers from Kafka clusters. For optimal performance, remove one broker per Availability Zone in a single broker removal operation. This feature is supported for Kafka versions 2.8.1 and above.
Heap Memory Management:
- Monitor the HeapMemoryAfterGC metric and set up CloudWatch alarms to take action when it exceeds 60%.
- Heap memory is where the Kafka broker stores in-memory data, such as messages, metadata, and other temporary data structures. If HeapMemory Exceeds Limits it can lead to Broker crashes, increased latency, and even data loss can happen in case of cluster misconfiguration.
Client Connections:
- Keep track of the ClientConnectionCount to effectively manage the number of active authenticated client connections.
AWS MSK (Kafka) vs SQS vs EventBridge vs Kinesis, Choosing the Right Tool

At first glance, AWS MSK (Kafka), Amazon SQS, Amazon EventBridge, and Amazon Kinesis may seem similar—they all deal with messaging, events, and data streaming. However, each service is designed for different purposes, and understanding these distinctions is critical when deciding which one to use. Let’s dive into the unique strengths of each service with a specific use case to illustrate their differences.
AWS MSK (Kafka)
Use Case: Real-time fraud detection in financial services.
AWS MSK shines in scenarios requiring real-time data streaming with high throughput and low latency, such as monitoring financial transactions to detect fraud as they occur. Its ability to handle complex event processing and analytics makes it indispensable in environments where immediate data processing is critical.
Amazon SQS
Use Case: Decoupling microservices in an e-commerce platform.
Amazon SQS is perfect for managing the communication between microservices in a distributed e-commerce system. By queuing orders and processing them asynchronously, SQS ensures that no orders are lost even if downstream services are temporarily unavailable, making it a reliable choice for message queuing.
Amazon EventBridge
Use Case: Automating operational tasks based on system events.
Amazon EventBridge is ideal for triggering automated actions when specific events occur, such as creating a support ticket when an application error is logged. Its ability to seamlessly integrate with other AWS services and third-party applications makes it a powerful tool for building event-driven architectures.
Amazon Kinesis
Use Case: Real-time data analytics for IoT sensor data.
Amazon Kinesis and AWS MSK (Kafka) are both designed to handle real-time data streaming and analytics, making them powerful tools for scenarios requiring immediate insights. However, while they share similar goals, they differ significantly in their approach and capabilities.
Similarities:
Both Kinesis and MSK are built to ingest and process large volumes of real-time data. They excel in scenarios where low latency and high throughput are critical, such as monitoring and analyzing streams of data from IoT sensors in a smart factory. Both services enable businesses to act on data in real-time, ensuring timely and informed decision-making.
Differences:
The key difference lies in their management and integration. Amazon Kinesis offers a serverless, fully managed experience, with tight integration into the AWS ecosystem, making it easier to set up, scale, and manage. It abstracts away the complexities of infrastructure management, automatically scaling to handle fluctuations in data volume, which makes it a seamless choice for those who prefer a hands-off approach.
On the other hand, AWS MSK (Kafka) provides more control and flexibility, particularly for users who need to fine-tune performance or integrate with existing Kafka applications. MSK is well-suited for complex event processing and environments where precise control over data flow and infrastructure is required, but it also involves more hands-on management compared to Kinesis.
Here is the table for you to better understand the differences between them:

Conclusion
AWS MSK provides a robust and managed environment for deploying and running Apache Kafka, empowering organizations to build and operate real-time data processing applications with ease. By gaining a clear understanding of the cost structure, implementing best practices, and leveraging the right pricing model, you can optimize your MSK deployments to meet the demands of both development and production environments.

.png)
.png)