Your Redis cluster lost a master and you expect a replica to take over within 15 seconds. Five minutes later the cluster is still in fail state, CLUSTER NODES shows the master as fail and the replicas as slave, no promotion happened, and your app is returning errors on any key in that slot range. This is a stuck failover. Causes are almost always quorum loss, network partition between Sentinels, replica priority misconfiguration, or replicas that fell too far behind to be eligible. Fix it by checking quorum, restoring connectivity between voters, and if necessary forcing a manual takeover.
Common causes
Ordered by hit rate.
1. Sentinel quorum not reached
You have 3 Sentinels but only 1 can see the failed master. Quorum requires majority, so no failover starts.
How to spot it: SENTINEL MASTERS on each Sentinel. If only one Sentinel reports the master as down, quorum is missing.
2. Network partition between Sentinels and replicas
Sentinels can see each other but not the replicas. Sentinels vote to start failover but cannot promote.
How to spot it: From each Sentinel host, redis-cli -h <replica> PING. Failures indicate partition.
3. Replica priority set to 0
Setting replica-priority 0 marks a replica as ineligible for promotion. If all replicas have priority 0, no failover possible.
How to spot it: CONFIG GET replica-priority on each replica. 0 = ineligible.
4. Replicas too far behind master
Default cluster-replica-validity-factor is 10. Master pings every 1s, so 10s of disconnection makes a replica ineligible. Long disconnect = no eligible replica.
How to spot it: INFO replication on each replica. master_link_down_since_seconds over 100 = replica out of date.
5. Cluster in cluster-require-full-coverage mode
With this set to yes, losing any slot range makes the whole cluster reject writes. Default in Redis 6+ is yes for safety.
How to spot it: CONFIG GET cluster-require-full-coverage. yes plus a failed slot range = cluster blocks writes.
6. Manual failover lock
Someone ran CLUSTER FAILOVER TAKEOVER and left the cluster in an inconsistent state. Subsequent automatic failover refuses to act.
How to spot it: Recent operational logs show manual cluster commands. Check CLUSTER NODES for unusual flags.
Before you start
- Confirm the cluster is actually stuck and not just slow. Default failover takes 15 to 30 seconds.
- Identify the affected slot ranges and which masters are down.
- Check application impact: which key prefixes are unreachable.
- Document the current state of every node before changing anything.
- Have a rollback plan: take a snapshot before touching the cluster.
Information to collect
CLUSTER NODESoutput from at least three different nodes.INFO replicationfrom every replica.SENTINEL MASTERSandSENTINEL SLAVES <name>from every Sentinel.- Network connectivity check between every Sentinel and every node.
- Recent operational logs and any manual commands run.
Step-by-step fix
Step 1: Verify quorum
# On each Sentinel
redis-cli -p 26379 SENTINEL MASTERS
# If 3 Sentinels but only 1 sees master as down,
# the other 2 cannot see it -> quorum lost
If quorum is lost, restore network connectivity first. Do not adjust quorum threshold downward; doing so introduces split-brain risk.
Step 2: Restore network between Sentinels and replicas
# Test connectivity from each Sentinel host
for replica in replica1 replica2 replica3; do
redis-cli -h $replica -p 6379 PING
done
# Check firewall and security groups
iptables -L | grep 6379
# AWS: check security group allows Sentinel subnets
# Re-establish if blocked
ufw allow from <sentinel-subnet> to any port 6379
Step 3: Check and fix replica priority
# On each replica
redis-cli -h replica1 CONFIG GET replica-priority
# Restore eligibility
redis-cli -h replica1 CONFIG SET replica-priority 100
# Persist to config file
redis-cli -h replica1 CONFIG REWRITE
Lower priority = preferred. 100 is default.
Step 4: Check replica freshness
# On each replica
redis-cli -h replica1 INFO replication
# Look for
# master_link_status: up | down
# master_link_down_since_seconds
# master_last_io_seconds_ago
If a replica has been down longer than cluster-replica-validity-factor * 10 seconds, it is ineligible. Either lower the validity factor or wait for replica to resync.
redis-cli CONFIG SET cluster-replica-validity-factor 20
Step 5: Force a manual takeover (last resort)
If automatic failover refuses and the cluster is in a stuck state:
# On the chosen replica
redis-cli -h replica1 CLUSTER FAILOVER
# Options:
# CLUSTER FAILOVER -- normal, requires master agreement
# CLUSTER FAILOVER FORCE -- replica is not in sync, but proceed
# CLUSTER FAILOVER TAKEOVER -- no consensus, last resort
TAKEOVER skips consensus checks. Use only when:
- The old master is confirmed dead
- Other replicas are confirmed unreachable
- You accept potential data loss of writes that did not replicate
Step 6: After promotion, verify cluster state
redis-cli CLUSTER INFO
# Expected:
# cluster_state: ok
# cluster_slots_assigned: 16384
# cluster_slots_ok: 16384
# cluster_known_nodes: <expected>
# Then reset all slot states if needed
redis-cli CLUSTER NODES
Step 7: Adjust cluster-require-full-coverage for graceful degradation
# Allow cluster to serve covered slot ranges even when one is down
redis-cli CONFIG SET cluster-require-full-coverage no
redis-cli CONFIG REWRITE
Trade-off: writes to uncovered slots will fail, but the rest of the cluster keeps serving.
Verify
CLUSTER INFOshowscluster_state: okfrom every node.SENTINEL MASTERSshows a healthy master with nodown-after-millisecondsset.- Application can read and write keys in the previously affected slot range.
- Replica lag (
master_repl_offsetdifference) returns to under 100 KB.
Long-term prevention
- Use an odd number of Sentinels (3 or 5) across at least 3 availability zones.
- Set
replica-priorityexplicitly for every replica; never leave at 0 unless intentional. - Monitor replica lag; alert when over 10 seconds for more than 1 minute.
- Run failover drills monthly; verify that automatic promotion completes in under 30 seconds.
- Document
cluster-require-full-coveragechoice; default no for read-availability, yes for write-consistency.
Common pitfalls
- Lowering Sentinel quorum below majority to “fix” failover — guarantees split-brain.
- Running
FAILOVER TAKEOVERwithout confirming old master is dead — data loss. - Forgetting to
CONFIG REWRITEafter changing priority — settings revert on restart. - Putting all Sentinels in one AZ — single zone failure takes down the whole cluster.
FAQ
Why does my failover take 30 seconds even when healthy? Default down-after-milliseconds is 30000. Lower to 5000 to 10000 for faster detection, accepting more false positives.
Should I lower replica-validity-factor? Only if you accept promoting a replica that may be tens of seconds behind. For caches: fine. For session stores: prefer waiting.
Can I prevent split-brain entirely? No, but you can minimize it: odd Sentinel count, multi-AZ deployment, conservative quorum, no manual TAKEOVER unless old master is confirmed dead.
Related
- Backend GraphQL rate limit cascade
- Backend Postgres connection pool exhausted
- Backend message queue dead-letter buildup
- Supabase cold start slow
- Edge function timeout
Tags: #Backend #Troubleshooting #redis