John Weldon

Recovering a JetStream Cluster After Quorum Loss

Losing quorum in a JetStream cluster is one of those situations that feels worse than it is. Meta operations stop, health checks fail, and logs fill with warnings. But with a clear understanding of what’s happening, recovery is straightforward.

This expands on a post I wrote for the NATS blog in 2024.

Recognizing Quorum Loss

A JetStream cluster requires a majority of nodes to agree before making changes.1 In a 3-node cluster, you need 2. In a 5-node cluster, you need 3. When that majority can’t be reached, the cluster can’t elect a meta leader - meaning no stream or consumer creation, deletion, or modification. Existing streams may continue operating if they have independent quorum.

You’ll see messages like these in the logs:

[WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[INF] JetStream cluster no metadata leader

In Kubernetes, you may see pods failing readiness probes:

Readiness probe failed: HTTP probe failed with statuscode: 503

The NATS CLI will confirm the problem:

$ nats server report jetstream
WARNING: No cluster meta leader found. The cluster expects 6 nodes but only 3 responded.
JetStream operations require at least 4 up nodes.

Common Causes

The most common cause I’ve seen: renaming cluster nodes in bulk rather than one at a time.

When you rename nodes simultaneously, the cluster ends up expecting both the old and new names. If you had 3 nodes and renamed all of them, the cluster now expects 6 nodes but only 3 are responding. You’ve lost quorum not because nodes are down, but because the cluster is waiting for nodes that no longer exist.

Other causes include:

Recovery Steps

1. Regain Quorum First

You can’t remove stale peers without a leader, and you can’t elect a leader without quorum.2 The first step is getting enough nodes online to reach majority.

If the cluster expects N nodes but only M are responding (where M < N/2 + 1), you need to add nodes.

Examples below use Kubernetes; adapt the scaling commands for your deployment:

# Example: cluster expects 6 nodes, 3 responding - need 4 for quorum
kubectl scale --replicas=4 statefulset/nats

Wait for the new node to join and a leader to be elected.

2. Remove Stale Peers

Once a leader exists, you can remove the old peer entries.3 Peer IDs are shown in the “ID” column of nats server report jetstream. The peer-remove command accepts either server name or peer ID.

Peer removal signals to JetStream that a node will never return:

# Using the CLI (preferred)
nats server cluster peer-remove -f <peer_id>

# Or using the JetStream API directly
nats publish '$JS.API.SERVER.REMOVE' '{"peer":"","peer_id":"<peer_id>"}'

You’ll get confirmation:

{
  "type": "io.nats.jetstream.api.v1.meta_server_remove_response",
  "success": true
}

3. Clean Up

After removing stale peers, scale back to your desired replica count and remove the temporary node:

kubectl scale --replicas=3 statefulset/nats
nats server cluster peer-remove -f <temporary_peer_id>

Prevention

The key insight: Raft-based systems track membership by node identity. Changing identities in bulk looks like a mass failure followed by new nodes joining, which confuses the membership tracking.

When renaming nodes or making identity changes:

This patience prevents the quorum loss scenario entirely.


  1. JetStream Clustering - A quorum is half the cluster size plus one, the minimum number of nodes needed to ensure data consistency after failure. ↩︎

  2. Disaster Recovery - NATS will create replacement stream replicas automatically once quorum is restored and stale nodes are removed. ↩︎

  3. Recovering NATS JetStream Quorum - Official NATS blog post with detailed recovery steps. The peer-remove command removes a node from the cluster’s RAFT meta group and triggers automatic replica rebalancing. ↩︎