The strategic AI-native platform for customer experience management

The strategic AI-native platform for customer experience management. Unify your customer-facing functions — from marketing and sales to customer experience and service — on a customizable, scalable, and fully extensible AI-native platform.

Platform & Technology

Re-engineering Sprinklr's ACD: Building High Availability Into Our Call-Routing Engine

August 4, 2025 • 9 MIN READ

Authors

Shreyas Gupta

Senior Product Engineer

At the core of Sprinklr’s Service platform lies the Automatic Call Distribution (ACD) engine: the system responsible for routing every single incoming/outgoing voice call and social case to the right customer support agent.

It is the heart of our customer service platform, handling millions of interactions across thousands of enterprise clients every day.

Originally built as a single-leader system, ACD relied on in-memory state and synchronous logic to achieve high-speed, real-time call routing.

This design delivered excellent performance, but it came with natural constraints around fault tolerance and failover.

As Sprinklr’s platform grew and the bar for availability rose, we recognized the need to re-architect ACD into a distributed, highly available system without sacrificing the low-latency performance it was known for.

In this post, we’ll walk through how we evolved ACD’s architecture to meet higher availability standards, support zero-downtime resilience, and enable large-scale concurrency across multiple pods.

Along the way, we’ll explore the core design principles, trade-offs, and technologies that made this transition both scalable and production ready.

Table of Contents

Preface: Basics of ACD
ACD's original architecture
The goal: Designing for high availability and scalability
The challenge: Distributed state without compromising speed
The winning approach: RedissonLocalCachedMap
Distributed consistency: Customizing for safety
Distributed locking: Redis vs Zookeeper
The new design: Distributed event processing
The hidden complexity: Making it real
The result: Horizontally scalable, highly available ACD

Preface: Basics of ACD

ACD (Automatic Call Distribution) is the core engine responsible for assigning support cases to agents in real time.

These cases can come from multiple channels: voice, email, or social. Voice calls can be either inbound (customer-initiated) or outbound (company-initiated via campaigns or callbacks).

Regardless of channel, ACD ensures that each case reaches the most suitable and available agent.

For inbound or social assignments, the concept of Work Queues is used. Pending cases are added to relevant work queues, and agents are mapped to queues they’re qualified to handle.

ACD runs an assignment event loop that continuously picks pending cases from these queues and tries to assign them to eligible agents based on real-time factors like:

Skills (what the agent is trained for)
Capacity (how many cases the agent can handle concurrently)
Priority (urgent cases must be assigned first), and several other factors

Basic overview ☝️

To enable parallel and scalable assignment, agents are grouped into “agent groups.” These are disjoint sets of agents that can be processed independently without compromising correctness. This allows multiple assignment loops to run in parallel across the system.

All of this happens in a constantly shifting landscape: new cases arrive, agents come and go, skills and capacities change, and business rules evolve.

ACD must respond to all of it instantly, making fair, accurate, and fast assignment decisions in real time.

To enable fast and accurate assignment, ACD maintains a real-time state of the system, especially the state of each agent.

With numerous operations happening in parallel, synchronization on this state is critical to uphold not just business rules like priority assignment, but fundamental guarantees as well like ensuring a call is assigned to only one agent, or an agent is assigned only one call at a time, and much more.

ACD's original architecture

ACD was originally designed for speed, optimized to handle real-time routing of voice calls, where every millisecond counts. The architecture followed a traditional leader-slave model:

A single leader pod performed all the critical operations: managed an in-memory state, took in-memory locks for synchronization, and processed events via an event loop to carry out assignment
Multiple slave pods handled client requests, eventually hitting the leader pod for getting or updating the in-memory state
They didn’t contribute to routing but were always ready to take over leadership.

Old architecture ☝️

This design enabled blazing-fast routing under normal conditions, thanks to its in-memory state and in-memory synchronization.

However, as our scale and requirements evolved, some architectural constraints became more apparent:

Single point of failure: The leader was the brain. If it slowed down (due to GC pauses, restarts, resource exhaustion, or hardware failures), the entire routing could stall for a moment until the non-leader pods become leader
No horizontal scalability: Regardless of how many pods we added, the leader remained the only executor of critical logic. Scaling the leader was limited to vertical scaling, which has hard limits
Fragile recovery: While leadership failover was technically fast, issues in the leader pod could lead to degradation, even if full outages were avoided

In short, ACD was fast under typical conditions, but with growing workloads, it became clear that we needed a more resilient approach to not put the entire platform at risk.

The goal: Designing for high availability and scalability

We set out to redesign ACD with these goals in mind:

High availability: Avoid platform-wide downtime due to the failure of a single pod
Horizontal scalability: Allow traffic to be seamlessly distributed across multiple pods
Reduced blast radius: Isolate critical flows (inbound vs. outbound) to minimize cross-impact during anomalies

The idea was simple but ambitious: every pod should be capable of fully handling all operations independently.

To make this possible, we needed to move from an in-memory state managed by the leader to a distributed shared state accessible by everyone.

And since it was a shared state, we needed distributed locking for synchronization. All of this needed to work while retaining as much of the original speed as possible.

The challenge: Distributed state without compromising speed

Replacing fast in-memory state with a shared distributed state typically introduces latency, and minimizing that was crucial for our use case.

After evaluating several options, we chose Redis for its combination of speed and maturity. We integrated it using Redisson, a powerful Java client that provides distributed data structures on top of Redis.

Our explorations went through multiple iterations:

Attempt 1: Redisson live object

At first, we tried Redisson's RedissonLiveObject offering, which transparently mapped Java objects to Redis by converting every field into a Redis key.

However, every getter and setter became a Redis call, resulting in unacceptably slow performance and very poor maintainability of code.

Attempt 2: Redisson map

Next, we experimented with RedissonMap, a Redis-backed concurrent hashmap. This was closer to our existing usage of ConcurrentHashMap, making integration smoother.

However, ACD's read-heavy workload meant that the overhead of constant Redis calls quickly became a bottleneck.

The winning approach: RedissonLocalCachedMap

Finally, we landed on RedissonLocalCachedMap which extended RedissonMap and gave us the best of both worlds:

Reads were served from a local cache, while true cache stayed on Redis
Writes updated Redis and invalidated the local caches via Redis pub/sub

This approach dramatically reduced network roundtrips for read-heavy operations while ensuring eventual consistency. But eventual consistency wasn’t enough; for certain operations, we needed stronger guarantees.

Distributed consistency: Customizing for safety

Distributed caching introduces a new class of problems: stale data, invalidation delays, and inconsistent reads.

To address this, we built custom safeguards like:

Lock + force refresh: After acquiring a distributed lock, we re-fetched the critical data being used directly from Redis — bypassing local cache — to ensure correctness before committing any assignment
Double-checked locking: We first performed operations optimistically using local state, only acquiring expensive distributed locks if the operation succeeds, followed by a second validation on the authoritative Redis state

These safeguards helped us minimize network overhead, reduce contention, and maintain correctness, all without sacrificing performance too much.

Distributed locking: Redis vs Zookeeper

We implemented both Redis-based and Zookeeper-based distributed locks. While Redis locking was faster in benchmarks, we chose to stick with Zookeeper for production due to its superior reliability.

Consistency is critical in a call-routing engine, the cost of accidentally assigning two calls to the same agent is far higher than a few milliseconds of latency.

The new design: Distributed event processing

In the new architecture:

Every pod is now fully capable of processing client requests, maintaining state, and running the assignment event loop for any agent group
Since work is divided by agent groups — disjoint subsets of agents that can be processed independently — these groups can now be parallelized across pods instead of just threads of the same pod
Each agent group can be picked up by any pod via distributed locks, allowing the routing load to be evenly balanced across the fleet

This design allows for:

Horizontal scalability: just add more pods
No single point of failure: pod failures cause minimal or no disruption
Reduced blast radius: issues in one group don't spill over to others

New architecture 👆

The hidden complexity: Making it real

The real challenge wasn't just building a distributed system. It was doing so in a way that maintained compatibility with the dozens of modules and hundreds of workflows depending on ACD, and its ability to scale in production.

We had to:

Guarantee correctness - feature behaviour should be identical to old architecture
Minimize performance degradation - speed is essential
Avoid creating new shared infrastructure bottlenecks - Redis and Zookeeper themselves must not become single points of failure if we start acquiring too many distributed locks or do too many Redis read/writes at scale

For correctness and performance, we conducted multiple cross-module tests, including load testing, to ensure all the guarantees and features we offer still behave the same without degradation in speed. For the infrastructure impact, we developed specialized monitoring tooling to track:

Redis calls: At different granularities, like client level or feature level
Distributed lock/unlock frequency
Distributed lock holder tracking

To achieve this, we went as far as overriding the Redisson library’s internal methods using Java reflection to expose detailed telemetry of Redis calls.

This deep observability helped us uncover and fix many anomalies, some in our own code, like a faulty piece of code which caused redundant agent group reprocessing that previously went unnoticed, but contributed lots of distributed locks in the new architecture.

And some anomalies we tracked down to the Redisson library itself, like an unoptimized implementation of computeIfAbsent() method in RedissonLocalCachedMap, which we addressed by contributing a fix back to the Redisson’s open-source codebase as well (PR #6089).

The result: Horizontally scalable, highly available ACD

After months of development, testing, and iterative tuning, the new ACD architecture has been rolled out in almost all environments across multiple regions, and we have seen promising results so far.

We’ve observed complete stability in the past quarter since the rollout. This was a big technical challenge that required tons of whiteboarding even for the smallest of optimizations.

Today, the new ACD is live and thriving! Handling calls, routing cases, and enabling our clients worldwide to serve their customers better.

And this is just the beginning, we are actively working to increase ACD’s throughput even more!

Table of contents

You only need one tool to drive real CX impact

Your customers expect unified experiences that point solutions can never deliver. See how Sprinklr’s Unified-CXM platform stands out.

Request Demo

Voice of the Customer: What it is, Business Benefits, ROI, & Tips

Interactive Voice Response (IVR): A Complete Guide to Building and Revamping for Scale

What Are Multi-Agent AI Systems? Use Cases, Benefits & the Future of Customer Support

Share This Article

Article Authors

Shreyas Gupta