Skip to content

Karafka Ecosystem Status

About This Page

This page tracks external incidents and ecosystem-level disruptions that may affect Karafka, WaterDrop, and Web UI users. This is not a Karafka uptime status page, but a curated log of upstream issues, dependency problems, and third-party service disruptions relevant to the Karafka ecosystem.

Current Status - Active Advisory

One active advisory. See details below.


Incident History

[IN PROGRESS] March 2026 - Kafka Coordinator Recovery API Under Development

  • Status: In Progress
  • Impact: Medium
  • Affected: Karafka users experiencing Kafka group coordinator FAILED state (not_coordinator errors)

This is not a Karafka or librdkafka issue - it originates from bugs in Apache Kafka itself. Kafka group coordinator failures caused by broker-side conditions (log compaction races, OOM, unclean restarts, network partitions, and rolling maintenance windows) can leave consumer groups permanently stuck in initializing state. When the coordinator enters a FAILED state, all group operations (JoinGroup, SyncGroup, OffsetCommit, DeleteGroup) return not_coordinator, and restarting consumer pods does not resolve the issue.

Symptoms:

  • Consumers stay in initializing state and never receive partition assignments
  • Karafka Web UI shows the consumer group as empty with no members
  • rdkafka logs report repeating not_coordinator errors
  • Other consumer groups on the same cluster work normally
  • Restarting consumer pods does not resolve the issue

Current Workarounds:

  • Switch to Direct Assignments to bypass the coordinator entirely
  • Use Kafka CLI tools (kafka-reassign-partitions, kafka-leader-election, kafka-consumer-groups) to force a coordinator reload on a healthy broker
  • File a support ticket with your managed Kafka provider (MSK, Confluent Cloud) to apply upstream Kafka patches

Resolution: A new Karafka::Admin::Recovery API is under active development. This API will allow users to read committed offsets directly from the __consumer_offsets log (bypassing the broken coordinator), assess the blast radius across consumer groups, and migrate offsets to a new healthy consumer group - all without broker-level intervention. This issue will be fully resolved once the new version of Karafka containing the Recovery API is released.

[RESOLVED] February 11, 2026 - Pro License Server Hardware Failure

  • Status: Fully Resolved
  • Impact: Moderate
  • Affected: Karafka Pro users attempting to access the license server during the incident window

A hardware failure on a network switch at Hetzner's infrastructure caused a temporary outage of the Karafka Pro license server. The incident began at approximately 2:47 PM CET and was fully resolved by 3:12 PM CET, resulting in approximately 25 minutes of downtime.

Incident Timeline (CET):

  • 2:47 PM - Initial detection of Hetzner infrastructure issues
  • 2:51 PM - Migration process initiated
  • 3:09 PM - Backup server deployment started
  • 3:12 PM - Service fully restored

Response: The automated disaster recovery system successfully performed server rotation, though the 15-minute threshold before activation meant some users experienced service interruption. Users requiring immediate access during the incident were offered offline (embedded) license setup as a workaround.

Improvements Implemented:

  • Reduced failover activation threshold from 15 minutes to approximately 60 seconds
  • Relocated the standby server to a fully independent provider and datacenter, ensuring infrastructure-level isolation from the primary
  • Upgraded health checks from periodic to continuous 60-second interval monitoring with automatic traffic rerouting
  • Improved data synchronization frequency between primary and standby servers

Resolution: These changes mean that a similar infrastructure failure would now result in under 1 minute of automatic failover, transparent to end users, compared to the ~25 minutes experienced during this incident.

[RESOLVED] August 4, 2025 - OpenSSL 3.0.17 Segmentation Faults

  • Status: Resolved
  • Impact: High
  • Affected: All Karafka and rdkafka-ruby applications with OpenSSL 3.0.17

OpenSSL 3.0.17 introduced a critical regression in X509_LOOKUP methods that caused widespread segmentation faults in applications using native SSL connections, including rdkafka-ruby and all Karafka applications.

Symptoms: Segmentation faults during SSL/TLS operations, application crashes on startup or during Kafka connections, intermittent crashes in containerized environments, and producer/consumer initialization failures with SSL.

Workaround: Pin OpenSSL to version 3.0.16 or 3.0.17 revision 2 or higher

Resolution: Fixed in OpenSSL 3.0.17-1~deb12u2 (released August 10, 2025) which reverted the problematic X509_LOOKUP changes

References: - OpenSSL Issue #28171 - Debian Bug #1110254 - rdkafka-ruby Issue #667


Last modified: 2026-03-22 21:47:20