Karafka Ecosystem Status¶
About This Page
This page tracks external incidents and ecosystem-level disruptions that may affect Karafka, WaterDrop, and Web UI users. This is not a Karafka uptime status page, but a curated log of upstream issues, dependency problems, and third-party service disruptions relevant to the Karafka ecosystem.
Current Status - Active Advisory
One active advisory. See details below.
Incident History¶
[IN PROGRESS] March 2026 - Kafka Coordinator Recovery API Under Development
- Status: In Progress
- Impact: Medium
- Affected: Karafka users experiencing Kafka group coordinator
FAILEDstate (not_coordinatorerrors)
This is not a Karafka or librdkafka issue - it originates from bugs in Apache Kafka itself. Kafka group coordinator failures caused by broker-side conditions (log compaction races, OOM, unclean restarts, network partitions, and rolling maintenance windows) can leave consumer groups permanently stuck in initializing state. When the coordinator enters a FAILED state, all group operations (JoinGroup, SyncGroup, OffsetCommit, DeleteGroup) return not_coordinator, and restarting consumer pods does not resolve the issue.
Symptoms:
- Consumers stay in
initializingstate and never receive partition assignments - Karafka Web UI shows the consumer group as empty with no members
rdkafkalogs report repeatingnot_coordinatorerrors- Other consumer groups on the same cluster work normally
- Restarting consumer pods does not resolve the issue
Current Workarounds:
- Switch to Direct Assignments to bypass the coordinator entirely
- Use Kafka CLI tools (
kafka-reassign-partitions,kafka-leader-election,kafka-consumer-groups) to force a coordinator reload on a healthy broker - File a support ticket with your managed Kafka provider (MSK, Confluent Cloud) to apply upstream Kafka patches
Resolution: A new Karafka::Admin::Recovery API is under active development. This API will allow users to read committed offsets directly from the __consumer_offsets log (bypassing the broken coordinator), assess the blast radius across consumer groups, and migrate offsets to a new healthy consumer group - all without broker-level intervention. This issue will be fully resolved once the new version of Karafka containing the Recovery API is released.
[RESOLVED] February 11, 2026 - Pro License Server Hardware Failure
- Status: Fully Resolved
- Impact: Moderate
- Affected: Karafka Pro users attempting to access the license server during the incident window
A hardware failure on a network switch at Hetzner's infrastructure caused a temporary outage of the Karafka Pro license server. The incident began at approximately 2:47 PM CET and was fully resolved by 3:12 PM CET, resulting in approximately 25 minutes of downtime.
Incident Timeline (CET):
- 2:47 PM - Initial detection of Hetzner infrastructure issues
- 2:51 PM - Migration process initiated
- 3:09 PM - Backup server deployment started
- 3:12 PM - Service fully restored
Response: The automated disaster recovery system successfully performed server rotation, though the 15-minute threshold before activation meant some users experienced service interruption. Users requiring immediate access during the incident were offered offline (embedded) license setup as a workaround.
Improvements Implemented:
- Reduced failover activation threshold from 15 minutes to approximately 60 seconds
- Relocated the standby server to a fully independent provider and datacenter, ensuring infrastructure-level isolation from the primary
- Upgraded health checks from periodic to continuous 60-second interval monitoring with automatic traffic rerouting
- Improved data synchronization frequency between primary and standby servers
Resolution: These changes mean that a similar infrastructure failure would now result in under 1 minute of automatic failover, transparent to end users, compared to the ~25 minutes experienced during this incident.
[RESOLVED] August 4, 2025 - OpenSSL 3.0.17 Segmentation Faults
- Status: Resolved
- Impact: High
- Affected: All Karafka and rdkafka-ruby applications with OpenSSL 3.0.17
OpenSSL 3.0.17 introduced a critical regression in X509_LOOKUP methods that caused widespread segmentation faults in applications using native SSL connections, including rdkafka-ruby and all Karafka applications.
Symptoms: Segmentation faults during SSL/TLS operations, application crashes on startup or during Kafka connections, intermittent crashes in containerized environments, and producer/consumer initialization failures with SSL.
Workaround: Pin OpenSSL to version 3.0.16 or 3.0.17 revision 2 or higher
Resolution: Fixed in OpenSSL 3.0.17-1~deb12u2 (released August 10, 2025) which reverted the problematic X509_LOOKUP changes
References: - OpenSSL Issue #28171 - Debian Bug #1110254 - rdkafka-ruby Issue #667
Last modified: 2026-03-22 21:47:20