Recovering a JetStream Cluster After Quorum Loss
Losing quorum in a JetStream cluster is one of those situations that feels worse than it is. Meta operations stop, health checks fail, and logs fill with warnings. But with a clear understanding of what’s happening, recovery is straightforward.
This expands on a post I wrote for the NATS blog in 2024.
Recognizing Quorum Loss
A JetStream cluster requires a majority of nodes to agree before making changes.1 In a 3-node cluster, you need 2. In a 5-node cluster, you need 3. When that majority can’t be reached, the cluster can’t elect a meta leader - meaning no stream or consumer creation, deletion, or modification. Existing streams may continue operating if they have independent quorum.
You’ll see messages like these in the logs:
[WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[INF] JetStream cluster no metadata leader
In Kubernetes, you may see pods failing readiness probes:
Readiness probe failed: HTTP probe failed with statuscode: 503
The NATS CLI will confirm the problem:
$ nats server report jetstream
WARNING: No cluster meta leader found. The cluster expects 6 nodes but only 3 responded.
JetStream operations require at least 4 up nodes.
Common Causes
The most common cause I’ve seen: renaming cluster nodes in bulk rather than one at a time.
When you rename nodes simultaneously, the cluster ends up expecting both the old and new names. If you had 3 nodes and renamed all of them, the cluster now expects 6 nodes but only 3 are responding. You’ve lost quorum not because nodes are down, but because the cluster is waiting for nodes that no longer exist.
Other causes include:
- Network partitions isolating a minority of nodes
- Storage failures preventing nodes from participating
- Misconfigured cluster membership
Recovery Steps
1. Regain Quorum First
You can’t remove stale peers without a leader, and you can’t elect a leader without quorum.2 The first step is getting enough nodes online to reach majority.
If the cluster expects N nodes but only M are responding (where M < N/2 + 1), you need to add nodes.
Examples below use Kubernetes; adapt the scaling commands for your deployment:
# Example: cluster expects 6 nodes, 3 responding - need 4 for quorum
kubectl scale --replicas=4 statefulset/nats
Wait for the new node to join and a leader to be elected.
2. Remove Stale Peers
Once a leader exists, you can remove the old peer entries.3 Peer IDs are shown in the “ID” column of nats server report jetstream. The peer-remove command accepts either server name or peer ID.
Peer removal signals to JetStream that a node will never return:
# Using the CLI (preferred)
nats server cluster peer-remove -f <peer_id>
# Or using the JetStream API directly
nats publish '$JS.API.SERVER.REMOVE' '{"peer":"","peer_id":"<peer_id>"}'
You’ll get confirmation:
{
"type": "io.nats.jetstream.api.v1.meta_server_remove_response",
"success": true
}
3. Clean Up
After removing stale peers, scale back to your desired replica count and remove the temporary node:
kubectl scale --replicas=3 statefulset/nats
nats server cluster peer-remove -f <temporary_peer_id>
Prevention
The key insight: Raft-based systems track membership by node identity. Changing identities in bulk looks like a mass failure followed by new nodes joining, which confuses the membership tracking.
When renaming nodes or making identity changes:
- Do it one node at a time
- Wait for the cluster to stabilize between changes
- Verify
nats server report jetstreamshows healthy state before proceeding
This patience prevents the quorum loss scenario entirely.
-
JetStream Clustering - A quorum is half the cluster size plus one, the minimum number of nodes needed to ensure data consistency after failure. ↩︎
-
Disaster Recovery - NATS will create replacement stream replicas automatically once quorum is restored and stale nodes are removed. ↩︎
-
Recovering NATS JetStream Quorum - Official NATS blog post with detailed recovery steps. The
peer-removecommand removes a node from the cluster’s RAFT meta group and triggers automatic replica rebalancing. ↩︎