CASE STUDY

Migration from on-prem Kafka Clusters to AWS MSK

Siteimprove’s application transformation journey is split into various phases. The very first phase is to migrate their core platform applications from the data centers located in the US and Europe to corresponding AWS regions. Their team, in partnership with Cornerstone Consulting Group, devised a migration strategy centered around re-platforming and rehosting their applications to AWS.

The core platform has a loosely coupled, highly distributed architecture. They have achieved this by implementing Apache Kafka as the messaging platform, which processes more than 5000 messages/sec between its core services. One of the key workstreams was to migrate the Apache Kafka clusters from on-prem to the Amazon MSK service in the US and Europe regions.

A migration strategy centered around re-platforming and rehosting applications to AWS.
0
All clusters were moved to AWS with zero downtime
0
+
All producer and consumer applications were moved to AWS with zero downtime
0
+
All topics were moved to AWS with zero downtime

With a sharp eye on the business requirements and support from AWS, we successfully migrated all on-prem and self-managed AWS Apache Kafka clusters to Amazon MSK. This migration was a critical deliverable in the overall migration of their data center assets to AWS.

Post migration, the MM2 replication process and all Apache Kafka clusters (within the colos and self-managed in AWS) were decommissioned.

The Danish SaaS company is now positioned to leverage the benefits of running their applications in AWS, such as auto scaling, increased agility for development and reducing the total cost for running the infrastructure.

The Challenge

Zero Downtime
Zero downtime for the producer and consumer applications.
No Data Loss
Ensure data durability and no data loss during the migration.
Minimal Code Mutation
Application code should undergo minimal to no changes.
Topic Names and Replication Factor
Topic names should remain the same between the source and target clusters. Replication factor per topic should be preserved.
Same or Higher Performance
Clusters on cloud should have same or higher performance than on-prem Kafka clusters.
With the requirements set, the technical team had the capability to provide real-time replication for topics between on-prem and AWS clusters.

Amazon MSK was selected as the target for this solution. The primary reasons were:

  • Eliminate operational overhead with managing cluster thereby reducing TCO.
  • Seamless application migration with no code changes.
  • Highly available and secure cluster provisioning within minutes with automatic cluster scaling.

The team also focused on the following points:

Producer/Consumer Dependency Mapping:

A dependency mapping chart was put together to identify the producers and consumers for various topics. This was a key exercise to ensure that the producers are migrated only after all the consumers are migrated to AWS and to prevent any data loss.

AWS MSK Capacity Planning:

The workload on the on-prem Apache Kafka clusters was analyzed and the target state MSK architecture was developed taking the service limits into consideration.

Message Replication:

The order of migration was of critical importance to ensure that the customer experience was not impacted. This meant that some consumers were migrated to Amazon MSK, but producers remained on their existing Kafka cluster. Replication was key as the messages produced on the existing Kafka clusters still had to be replicated to Amazon MSK, so that the migrated consumers could consume those messages.

The following design principles were implemented to avoid any performance impact on the DirectConnect network between the datacenter and AWS or the MSK cluster itself:

  • The migration of topics was managed in a certain order to make sure all available direct connect bandwidth is not consumed by the Kafka migration process.
  • All topics on Amazon MSK were created with the same configurations such as replication factor and number of partitions.
  • The offsets are replicated to make sure the consumers can resume processing from the next message onwards when the services are started in AWS.

MirrorMaker 2 (MM2) was implemented for unidirectional message replication between the on-prem and self-managed AWS Apache Kafka clusters and Amazon MSK. The MM2 service was installed on the on-prem K8S cluster and all topics and messages were replicated to Amazon MSK.

The Danish SaaS company leverages Datadog as the core monitoring and alerting solution across its infrastructure and application estate. There are two options to integrate Datadog with the AWS MSK service:

CloudWatch Crawler

Datadog can pull all CloudWatch metrics from the AWS account.

Datadog Agent

Datadog agent is software that runs on an EC2 instance. It collects events and metrics from the source instance, which in our case is the Amazon MSK cluster, and forwards them to Datadog. Datadog agent was used to crawl monitoring metrics from Amazon MSK JMX port (using open monitoring) to provide near real-time monitoring.

For the MSK implementation, an autoscaling group was implemented across two AZs with pre-configured Datadog agents. The Datadog agents collected metrics from Amazon MSK Open Monitoring port and forwarded them to Datadog.

The on-prem cluster had 200 topics with 24 partitions per topic and a replication factor of three, resulting in 14,400 partitions (200*24*3). To accommodate these partitions in AWS without comprising on resiliency and durability of the cluster, a highly available solution was developed consisting of four broker nodes across two AZs on kafka.m5.4x.large nodes. Due to the hard limit of 4,000 partitions per broker node in a 4x.large cluster, the cluster was configured with four broker nodes (14,000/4000) as a minimum configuration.

Ready to migrate to Cloud?

Contact us

"*" indicates required fields

Speak to an expert