The AI-first unified platform for front-office teams
Consolidate listening and insights, social media management, campaign lifecycle management and customer service in one unified platform.

Re-engineering Sprinklr's ACD: Building High Availability Into Our Call-Routing Engine
At the core of Sprinklr’s Service platform lies the Automatic Call Distribution (ACD) engine: the system responsible for routing every single incoming/outgoing voice call and social case to the right customer support agent.
It is the heart of our customer service platform, handling millions of interactions across thousands of enterprise clients every day.
Originally built as a single-leader system, ACD relied on in-memory state and synchronous logic to achieve high-speed, real-time call routing.
This design delivered excellent performance, but it came with natural constraints around fault tolerance and failover.
As Sprinklr’s platform grew and the bar for availability rose, we recognized the need to re-architect ACD into a distributed, highly available system without sacrificing the low-latency performance it was known for.
In this post, we’ll walk through how we evolved ACD’s architecture to meet higher availability standards, support zero-downtime resilience, and enable large-scale concurrency across multiple pods.
Along the way, we’ll explore the core design principles, trade-offs, and technologies that made this transition both scalable and production ready.
- Preface: Basics of ACD
- ACD's original architecture
- The goal: Designing for high availability and scalability
- The challenge: Distributed state without compromising speed
- The winning approach: RedissonLocalCachedMap
- Distributed consistency: Customizing for safety
- Distributed locking: Redis vs Zookeeper
- The new design: Distributed event processing
- The hidden complexity: Making it real
- The result: Horizontally scalable, highly available ACD
Preface: Basics of ACD
ACD (Automatic Call Distribution) is the core engine responsible for assigning support cases to agents in real time.
These cases can come from multiple channels: voice, email, or social. Voice calls can be either inbound (customer-initiated) or outbound (company-initiated via campaigns or callbacks).
Regardless of channel, ACD ensures that each case reaches the most suitable and available agent.
For inbound or social assignments, the concept of Work Queues is used. Pending cases are added to relevant work queues, and agents are mapped to queues they’re qualified to handle.
ACD runs an assignment event loop that continuously picks pending cases from these queues and tries to assign them to eligible agents based on real-time factors like:
- Skills (what the agent is trained for)
- Capacity (how many cases the agent can handle concurrently)
- Priority (urgent cases must be assigned first), and several other factors
Basic overview ☝️
To enable parallel and scalable assignment, agents are grouped into “agent groups.” These are disjoint sets of agents that can be processed independently without compromising correctness. This allows multiple assignment loops to run in parallel across the system.
All of this happens in a constantly shifting landscape: new cases arrive, agents come and go, skills and capacities change, and business rules evolve.
ACD must respond to all of it instantly, making fair, accurate, and fast assignment decisions in real time.
To enable fast and accurate assignment, ACD maintains a real-time state of the system, especially the state of each agent.
With numerous operations happening in parallel, synchronization on this state is critical to uphold not just business rules like priority assignment, but fundamental guarantees as well like ensuring a call is assigned to only one agent, or an agent is assigned only one call at a time, and much more.
ACD's original architecture
ACD was originally designed for speed, optimized to handle real-time routing of voice calls, where every millisecond counts. The architecture followed a traditional leader-slave model:
- A single leader pod performed all the critical operations: managed an in-memory state, took in-memory locks for synchronization, and processed events via an event loop to carry out assignment
- Multiple slave pods handled client requests, eventually hitting the leader pod for getting or updating the in-memory state
- They didn’t contribute to routing but were always ready to take over leadership.
Old architecture ☝️
This design enabled blazing-fast routing under normal conditions, thanks to its in-memory state and in-memory synchronization.
However, as our scale and requirements evolved, some architectural constraints became more apparent:
- Single point of failure: The leader was the brain. If it slowed down (due to GC pauses, restarts, resource exhaustion, or hardware failures), the entire routing could stall for a moment until the non-leader pods become leader
- No horizontal scalability: Regardless of how many pods we added, the leader remained the only executor of critical logic. Scaling the leader was limited to vertical scaling, which has hard limits
- Fragile recovery: While leadership failover was technically fast, issues in the leader pod could lead to degradation, even if full outages were avoided
In short, ACD was fast under typical conditions, but with growing workloads, it became clear that we needed a more resilient approach to not put the entire platform at risk.
The goal: Designing for high availability and scalability
We set out to redesign ACD with these goals in mind:
- High availability: Avoid platform-wide downtime due to the failure of a single pod
- Horizontal scalability: Allow traffic to be seamlessly distributed across multiple pods
- Reduced blast radius: Isolate critical flows (inbound vs. outbound) to minimize cross-impact during anomalies
The idea was simple but ambitious: every pod should be capable of fully handling all operations independently.
To make this possible, we needed to move from an in-memory state managed by the leader to a distributed shared state accessible by everyone.
And since it was a shared state, we needed distributed locking for synchronization. All of this needed to work while retaining as much of the original speed as possible.
The challenge: Distributed state without compromising speed
Replacing fast in-memory state with a shared distributed state typically introduces latency, and minimizing that was crucial for our use case.
After evaluating several options, we chose Redis for its combination of speed and maturity. We integrated it using Redisson, a powerful Java client that provides distributed data structures on top of Redis.
Our explorations went through multiple iterations:
Attempt 1: Redisson live object
At first, we tried Redisson's RedissonLiveObject offering, which transparently mapped Java objects to Redis by converting every field into a Redis key.
However, every getter and setter became a Redis call, resulting in unacceptably slow performance and very poor maintainability of code.
Attempt 2: Redisson map
Next, we experimented with RedissonMap, a Redis-backed concurrent hashmap. This was closer to our existing usage of ConcurrentHashMap, making integration smoother.
However, ACD's read-heavy workload meant that the overhead of constant Redis calls quickly became a bottleneck.
The winning approach: RedissonLocalCachedMap
Finally, we landed on RedissonLocalCachedMap which extended RedissonMap and gave us the best of both worlds:
- Reads were served from a local cache, while true cache stayed on Redis
- Writes updated Redis and invalidated the local caches via Redis pub/sub
This approach dramatically reduced network roundtrips for read-heavy operations while ensuring eventual consistency. But eventual consistency wasn’t enough; for certain operations, we needed stronger guarantees.
Distributed consistency: Customizing for safety
Distributed caching introduces a new class of problems: stale data, invalidation delays, and inconsistent reads.
To address this, we built custom safeguards like:
- Lock + force refresh: After acquiring a distributed lock, we re-fetched the critical data being used directly from Redis — bypassing local cache — to ensure correctness before committing any assignment
- Double-checked locking: We first performed operations optimistically using local state, only acquiring expensive distributed locks if the operation succeeds, followed by a second validation on the authoritative Redis state
These safeguards helped us minimize network overhead, reduce contention, and maintain correctness, all without sacrificing performance too much.
Distributed locking: Redis vs Zookeeper
We implemented both Redis-based and Zookeeper-based distributed locks. While Redis locking was faster in benchmarks, we chose to stick with Zookeeper for production due to its superior reliability.
Consistency is critical in a call-routing engine, the cost of accidentally assigning two calls to the same agent is far higher than a few milliseconds of latency.
The new design: Distributed event processing
In the new architecture:
- Every pod is now fully capable of processing client requests, maintaining state, and running the assignment event loop for any agent group
- Since work is divided by agent groups — disjoint subsets of agents that can be processed independently — these groups can now be parallelized across pods instead of just threads of the same pod
- Each agent group can be picked up by any pod via distributed locks, allowing the routing load to be evenly balanced across the fleet
This design allows for:
- Horizontal scalability: just add more pods
- No single point of failure: pod failures cause minimal or no disruption
- Reduced blast radius: issues in one group don't spill over to others
New architecture 👆
The hidden complexity: Making it real
The real challenge wasn't just building a distributed system. It was doing so in a way that maintained compatibility with the dozens of modules and hundreds of workflows depending on ACD, and its ability to scale in production.
We had to:
- Guarantee correctness - feature behaviour should be identical to old architecture
- Minimize performance degradation - speed is essential
- Avoid creating new shared infrastructure bottlenecks - Redis and Zookeeper themselves must not become single points of failure if we start acquiring too many distributed locks or do too many Redis read/writes at scale
For correctness and performance, we conducted multiple cross-module tests, including load testing, to ensure all the guarantees and features we offer still behave the same without degradation in speed. For the infrastructure impact, we developed specialized monitoring tooling to track:
- Redis calls: At different granularities, like client level or feature level
- Distributed lock/unlock frequency
- Distributed lock holder tracking
To achieve this, we went as far as overriding the Redisson library’s internal methods using Java reflection to expose detailed telemetry of Redis calls.
This deep observability helped us uncover and fix many anomalies, some in our own code, like a faulty piece of code which caused redundant agent group reprocessing that previously went unnoticed, but contributed lots of distributed locks in the new architecture.
And some anomalies we tracked down to the Redisson library itself, like an unoptimized implementation of computeIfAbsent() method in RedissonLocalCachedMap, which we addressed by contributing a fix back to the Redisson’s open-source codebase as well (PR #6089).
The result: Horizontally scalable, highly available ACD
After months of development, testing, and iterative tuning, the new ACD architecture has been rolled out in almost all environments across multiple regions, and we have seen promising results so far.
We’ve observed complete stability in the past quarter since the rollout. This was a big technical challenge that required tons of whiteboarding even for the smallest of optimizations.
Today, the new ACD is live and thriving! Handling calls, routing cases, and enabling our clients worldwide to serve their customers better.
And this is just the beginning, we are actively working to increase ACD’s throughput even more!