How Bazaarvoice modernized their Apache Kafka infrastructure with Amazon MSK

January 20, 2026

This is a guest post by Oleh Khoruzhenko, Senior Staff DevOps Engineer at Bazaarvoice, in partnership with AWS.

Bazaarvoice is an Austin-based company powering a world-leading reviews and ratings platform. Our system processes billions of consumer interactions through ratings, reviews, images, and videos, helping brands and retailers build shopper confidence and drive sales by using authentic user-generated content (UGC) across the customer journey. The Bazaarvoice Trust Mark is the gold standard in authenticity.

Apache Kafka is one of the core components of our infrastructure, enabling real-time data streaming for the global review platform. Although Kafka’s distributed architecture met our needs for high-throughput, fault-tolerant streaming, self-managing this complex system diverted critical engineering resources away from our core product development. Each component of our Kafka infrastructure required specialized expertise, ranging from configuring low-level parameters to maintaining the complex distributed systems that our customers rely on. The dynamic nature of our environment demanded continuous care and investment in automation. We found ourselves constantly managing upgrades, applying security patches, implementing fixes, and addressing scaling needs as our data volumes grew.

In this post, we show you the steps we took to migrate our workloads from self-hosted Kafka to Amazon Managed Streaming for Apache Kafka (Amazon MSK). We walk you through our migration process and highlight the improvements we achieved after this transition. We show how we minimized operational overhead, enhanced our security and compliance posture, automated key processes, and built a more resilient platform while maintaining the high performance our global customer base expects.

The need for modernization

As our platform grew to process billions of daily consumer interactions, we needed to find a way to scale our Kafka clusters efficiently while maintaining a small team to manage the infrastructure. The limitations of self-managed Kafka clusters manifested in several key areas:

  • Scaling operations – Although scaling our self-hosted Kafka clusters wasn’t inherently complex, it required careful planning and execution. Each time we needed to add new brokers to handle increased workload, our team faced a multi-step process involving capacity planning, infrastructure provisioning, and configuration updates.
  • Configuration complexity – Kafka offers hundreds of configuration parameters. Although we didn’t actively manage all of these, understanding their impact was important. Key settings like I/O threads, memory buffers, and retention policies needed ongoing attention as we scaled. Even minor adjustments could have significant downstream effects, requiring our team to maintain deep expertise in these parameters and their interactions to ensure optimal performance and stability.
  • Infrastructure management and capacity planning – Self-hosting Kafka required us to manage multiple scaling dimensions, including compute, memory, network throughput, storage throughput, and storage volume. We needed to carefully plan capacity for all these components, often making complex trade-offs. Beyond capacity planning, we were responsible for real-time management of our Kafka infrastructure. This included promptly detecting and addressing component failures and performance issues. Our team needed to be highly responsive to alerts, often requiring immediate action to maintain system stability.
  • Specialized expertise requirements – Operating Kafka at scale demanded deep technical expertise across multiple domains. The team needed to:
    • Monitor and analyze hundreds of performance metrics
    • Conduct complex root cause analysis for performance issues
    • Manage ZooKeeper ensemble coordination
    • Execute rolling updates for zero-downtime upgrades and security patches

These challenges were compounded during peak business periods, such as Black Friday and Cyber Monday, when maintaining optimal performance was essential for Bazaarvoice’s retail customers.

Choosing Amazon MSK

After evaluating various options, we selected Amazon MSK as our modernization solution. The decision was driven by the service’s ability to minimize operational overhead, provide high availability out of the box with its three Availability Zone architecture, and offer seamless integration with our existing AWS infrastructure.

Key capabilities that made Amazon MSK the clear choice:

  • AWS integration – We already used AWS services for data processing and analytics. Amazon MSK connected directly with these services, alleviating the need to build and maintain custom integrations. This meant our existing data pipelines would continue working with minimal changes.
  • Automated operations management – Amazon MSK automated our most time-consuming tasks. We no longer need to manually monitor instances and storage for failures or respond to these issues ourselves.
  • Enterprise-grade reliability – The platform’s architecture matched our reliability requirements out of the box. Multi-AZ distribution and built-in replication gave us the same fault tolerance we’d carefully built into our self-hosted system, now backed by AWS’s service guarantees.
  • Simplified upgrade process – Before Amazon MSK, version upgrades for our Kafka clusters required careful planning and execution. The process was complex, involving multiple steps and risks. Amazon MSK simplified our upgrade operations. We now use automated upgrades for dev and test workloads and maintain control over production environments. This shift reduced the need for extensive planning sessions and multiple engineers. As a result, we stay current with the latest Kafka versions and security patches, improving our system reliability and performance.
  • Enhanced security controls – Our platform required ISO 27001 compliance, which typically involved months of documentation and security controls implementation. Amazon MSK came with this certification built-in, alleviating the need for separate compliance work. Amazon MSK encrypted our data, controlled network access, and integrated with our existing security tools.

With Amazon MSK selected as our target platform, we began planning the complex task of migrating our critical streaming infrastructure without disrupting the billions of consumer interactions flowing through our system.

Bazaarvoice’s migration journey

Moving our complex Kafka infrastructure to Amazon MSK required careful planning and precise execution. Our platform processes data through two main components: an Apache Kafka Streams pipeline that handles data processing and augmentation, and client applications that move this enriched data to downstream systems. With 40 TB of state across 250 internal topics, this migration demanded a methodical approach.

Planning phase

Working with AWS Solutions Architects proved critical for validating our migration strategy. Our platform’s unique characteristics required special consideration:

  • Multi-Region deployment across the US and EU
  • Complex stateful applications with strict data consistency needs
  • Vital business services requiring zero downtime
  • Diverse consumer ecosystem with different migration requirements

Migration challenges

The biggest hurdle was migrating our stateful Kafka Streams applications. Our data processing runs as a directed acyclic graph (DAG) of applications across regions, using static group membership to prevent disruptive rebalancing. It’s important to note that Kafka Streams keeps its state in internal Kafka topics. For applications to recover properly, replicating this state accurately is crucial. This characteristic of Kafka Streams added complexity to our migration process. Initially, we considered MirrorMaker2, the standard tool for Kafka migrations. However, two fundamental limitations made it challenging:

  • Risk of losing state or incorrectly replicating state across our applications.
  • Inability to run two instances of our applications simultaneously, which meant we needed to shut down the main application and wait for it to recover from the state in the MSK cluster. Given the size of our state, this recovery process exceeded our 30-minute SLA for downtime.

Our solution

We decided to deploy a parallel stack of Kafka Streams applications reading and writing data from Amazon MSK. This approach gave us sufficient time for testing and verification, and enabled the applications to hydrate their state before we delivered the output to our data warehouse for analytics. We used MirrorMaker2 for input topic replication, while our solution offered several advantages:

  • Simplified monitoring of the replication process
  • Avoided consistency issues between state stores and internal topics
  • Allowed for gradual, controlled migration of consumers
  • Enabled thorough validation before cutover
  • Required a coordinated transition plan for all consumers, because we couldn’t transfer consumer offsets across clusters

Consumer migration strategy

Each consumer type required a carefully tailored approach:

  • Standard consumers – For applications supporting Kafka Consumer Group protocol, we implemented a four-step migration. This approach risked some duplicate processing, but our applications were designed to handle this scenario. The steps were as follows:
    • Configure consumers with auto.offset.reset: latest.
    • Stop all DAG producers.
    • Wait for existing consumers to process remaining messages.
    • Cut over consumer applications to Amazon MSK.
  • Apache Kafka Connect Sinks – Our sink connectors served two critical databases:
    • A distributed search and analytics engine – Document versioning depended on Kafka record offsets, making direct migration impossible. To address this, we implemented a solution that involved building new search engine clusters from scratch.
    • A document-oriented NoSQL database – This supported direct migration without requiring new database instances, simplifying the process significantly.
  • Apache Spark and Flink applications – These presented unique challenges due to their internal checkpointing mechanisms:
    • Offsets managed outside Kafka’s consumer groups
    • Checkpoints incompatible between source and target clusters
    • Required complete data reprocessing from the beginning

We scheduled these migrations during off-peak hours to minimize impact.

Technical benefits and improvements

Moving to Amazon MSK fundamentally changed how we manage our Kafka infrastructure. The transformation is best illustrated by comparing key operational tasks before and after the migration, summarized in the following table.

Activity Before: Self-Hosted Kafka After: Amazon MSK
Security patching Required dedicated team time for Kafka and OS updates Fully automated
Broker recovery Needed manual monitoring and intervention Fully automated
Client authentication Complex password rotation procedures AWS Identity and Access Management (IAM)
Version upgrades Complex procedure requiring extensive planning Fully automated

The details of the tasks are as follows:

  • Security patching – Previously, our team spent 8 hours monthly applying Kafka and operating system (OS) security patches across our broker fleet. Amazon MSK now handles these updates automatically, maintaining our security posture without engineering intervention.
  • Broker recovery – Although our self-hosted Kafka had automatic recovery capabilities, each incident required careful monitoring and occasional manual intervention. With Amazon MSK, node failures and storage degradation issues such as Amazon Elastic Block Store (Amazon EBS) slowdowns are handled entirely by AWS and resolved within minutes without our involvement.
  • Authentication management – Our self-hosted implementation required password rotations for SASL/SCRAM authentication, a process that took two engineers several days to coordinate. The direct integration between Amazon MSK and AWS Identity and Access Management (IAM) minimized this overhead while strengthening our security controls.
  • Version upgrades – Kafka version upgrades in our self-hosted environment required weeks of planning and testing as well as weekend maintenance windows. Amazon MSK manages these upgrades automatically during off-peak hours, maintaining our SLAs without disruption.

These improvements proved especially valuable during high-traffic periods like Black Friday, when our team previously needed extensive operational readiness plans. Now, the built-in resiliency of Amazon MSK provides us with reliable Kafka clusters that serve as mission-critical infrastructure for our business. The migration made it possible to break our monolithic clusters into smaller, dedicated MSK clusters. This improved our data isolation, provided better resource allocation, and enhanced performance predictability for high-priority workloads.

Lessons learned

Our migration to Amazon MSK revealed several key insights that can help other organizations modernize their Kafka infrastructure:

  • Expert validation – Working with AWS Solutions Architects to validate our migration strategy caught several critical issues early. Although our team knew our applications well, external Kafka experts identified potential problems with state management and consumer offset handling that we hadn’t considered. This validation prevented costly missteps during the migration.
  • Data verification – Comparing data across Kafka clusters proved challenging. We built tools to capture topic snapshots in Parquet format on Amazon Simple Storage Service (Amazon S3), enabling quick comparisons using Amazon Athena queries. This approach gave us confidence that data remained consistent throughout the migration.
  • Start small – Beginning with our smallest data universe in QA helped us refine our process. Each subsequent migration went smoother as we applied lessons from previous iterations. This gradual approach helped us maintain system stability while building team confidence.
  • Detailed planning – We created specific migration plans with each team, considering their unique requirements and constraints. For example, our machine learning pipeline needed special handling due to strict offset management requirements. This granular planning prevented downstream disruptions.
  • Performance optimization – We found that utilizing Amazon MSK provisioned throughput offered clear cost advantages when storage throughput became a bottleneck. This feature made it possible to improve cluster performance without scaling instance sizes or adding brokers, providing a more efficient solution to our throughput challenges.
  • Documentation – Maintaining detailed migration runbooks proved invaluable. When we encountered similar issues across different migrations, having documented solutions saved significant troubleshooting time.

Conclusion

In this post, we showed you how we modernized our Kafka infrastructure by migrating to Amazon MSK. We walked through our decision-making process, challenges faced, and strategies employed. Our journey transformed Kafka operations from a resource-intensive, self-managed infrastructure to a streamlined, managed service, improving operational efficiency, platform reliability, and team productivity. For enterprises managing self-hosted Kafka infrastructure, our experience demonstrates that successful transformation is achievable with proper planning and execution. As data streaming needs grow, modernizing infrastructure becomes a strategic imperative for maintaining competitive advantage.

For more information, visit the Amazon MSK product page, and explore the comprehensive Developer Guide to learn about the features available to help you build scalable and reliable streaming data applications on AWS.

About the authors