How Data Shard Failures Affect Request Success Rate - Tair

In a Redis cluster instance with N data shards, a single shard failure triggers a master-replica failover. During failover — which typically lasts from a few seconds to tens of seconds — all requests routed to that shard fail. If traffic is evenly distributed across shards, the theoretical failure rate is 1/N.

In practice, the actual failure rate often exceeds this theoretical value. The following sections explain the three causes and how to address each one. The examples use a cluster instance that contains two data shards and runs in proxy mode.

Note

In direct connection mode, only multi-key commands cause the actual failure rate to exceed 1/N.

Why the actual failure rate exceeds 1/N

The three causes fall into two layers:

Protocol layer: Multi-key commands that span the failing shard
Client layer: Single-connection async clients and connection pool exhaustion

Multi-key commands touch the failing shard

When a command references keys distributed across multiple shards, a single shard failure can cause the entire command to fail — not just the portion targeting the unavailable shard.

Proxy mode: The proxy splits multi-key commands into subcommands and routes each one to the corresponding shard via the routing table.
Direct connection mode: The client sends each subcommand directly to the corresponding shard.

If any subcommand targets a failing shard, the full request fails, as shown below.

Fix: Minimize the number of keys per command to reduce the blast radius of a shard failure.

Single-connection clients propagate one failure to subsequent requests

Some clients, such as Lettuce, multiplex requests asynchronously over a single connection. Redis Serialization Protocol (RESP) requires responses to be returned in the same order as requests were sent. If an intermediate request fails — for example, GET key2 fails because its target shard is unavailable — the client cannot receive the response to any request sent after it, even if those requests succeeded.

Fix: Switch to a client that supports connection pooling, such as Jedis.

Connection pool exhaustion blocks or fails new requests

Connection pool clients have a configurable maximum number of connections. When all connections are occupied and no idle connections are available, new requests either fail or block — depending on the blockWhenExhausted setting.

Consider a Jedis client with maxTotal set to 3 and timeout set to 2000 ms. If three GET key2 requests are initiated within 2 seconds and the target shard is slow or unresponsive, all three connections stay open until they time out. The pool is exhausted for the full 2 seconds, and any new request during this window either fails immediately or blocks until a connection is released.

Fix: Set an appropriate JedisPool resource pool size and reduce the timeout value to 200–300 ms instead of the default 2,000 ms. A shorter timeout limits how long a failing connection occupies the pool, so healthy requests proceed faster.