Transform CX with AI at the core of every interaction
Unify fragmented interactions across 30+ voice, social and digital channels with an AI-native customer experience platform. Deliver consistent, extraordinary brand experiences at scale.

From Triple Hop to Single Stop: Are End-to-End Voice Models the Future of Customer Calls?
“ Hello, valued customer… your call is very important to us… please stay on the line…”
If you’ve ever been serenaded by this robotic lullaby you already know the pain points of today’s voice bots : laggy speech-to-text (ASR), a thinking pause while the LLM reasons, another pause while text-to-speech (TTS) renders the audio… and only then do you finally hear a response.
Today we’ll look at how OpenAI’s brand-new real-time Voice API tries to squash all that latency, and whether it delivers when we put it through a battery of tests.
How Voice Bots Work Today
Until now, most voice bots worked like a relay race with three runners.
- First, an Automatic Speech Recognition (ASR) engine listens to what you say and turns the sound waves into text.
- Next, a large language model (LLM) reads that text, thinks about an answer, and spits out new text.
- Finally, a Text-to-Speech (TTS) system takes the LLM’s words and converts them back into audio so you can hear the response.
Every hand-off involves an extra network trip, extra compute time and some buffering
Realtime End-to-End Voice Bots
Voice-to-Voice Models bypass this traditional architecture and provide a single model that takes in audio input and outputs audio as well. It provides multiple advantages over the system describe above. Some of them are mentioned below:
Metric | Traditional Voice Bot | Realtime End-to-End Voice Bot |
Latency | Since three separate systems are involved working in a sequence, latency is comparatively quite high | It is a single neural model which directly operates on voice so much lesser latency |
Streaming | Although LLM is directly stream-able due to its auto-regressive nature, ASR and TTS are usually kept non streaming for higher accuracy. | Since ASR and TTS are now consumer inside the neural model(earlier LLM), it is also directly stream-able. |
Development Effort | Quite High as infrastructure maintenance, model development, cost tracking etc needs to be done separately for each component | Comparatively simpler to maintain |
Failure Points | Has 3 separate points of failure but gives more control over the inference pipeline as each component can be tuned | Single point of failure but we loose the control of tracking intermediate states like transcription as they are hidden within the model. |
Benchmarking End-To-End Voice Models
As one can see, that there are multiple advantages of switching to a voice model but the big question that arises is that are these model actually as good as traditional systems in terms of Accuracy / Quality !!
Hence we benchmarked these agents along two dimensions:
Instruction Following
- It is the ability of a model to follow the prompt/instruction given to it in a conversation
Agentic Capabilities
- It is the ability of a model to complete a task that involves calling multiple tools such as the ability to properly schedule a refund for a Refund Agent given an Environment.
We will be using the Tau Bench Dataset (https://github.com/sierra-research/tau-bench) to benchmark both of them. Methodology used and the results are detailed in the section below!
Instruction Following Accuracy
Methodology
To benchmark the instruction-following skill of OpenAI’s new gpt-realtime model, we turned it into a “mock customer” and put it under a microscope while keeping every other variable stable.
System prompt (fixed persona)
- We fed gpt-realtime a detailed role prompt that told it exactly how to behave: a retail shopper with issues like exchanges, refunds, or damaged goods.
- The rules also forced it to speak only one short line at a time, reveal facts gradually, and never break character.
- This prompt never changed during the entire study, so any deviation we observed would be a failure of instruction loyalty.
Voice-only output
- gpt-realtime produced its replies as audio.
- We immediately transcribed that audio to text so the next model could read it.
- No extra logic, no extra guardrails—just a straight sample of the model’s raw obedience to its instructions.
The “agent” side (controlled baseline)
- Incoming transcripts were handed to a pre-vetted support agent built on gpt-4.1-mini.
- That agent had access to the usual Tau Bench tool set (order lookup, refund API, etc.) and is already known to solve retail tasks reliably.
- Because the agent is stable and well-tested, any wobble in task success can be traced back to the user simulation, i.e., gpt-realtime.
Scoring logic
- Tau Bench judged whether the agent completed or failed each customer-service task.
- Since the only “moving part” was our user simulation, the pass/fail rate becomes a direct proxy for how faithfully gpt-realtime followed its persona instructions and produced coherent, actionable utterances.
In short, we isolated the intelligence and prompt-adherence of gpt-realtime by making it the sole variable in a controlled retail scenario. The assistant (gpt-4.1-mini) and its tool stack were neutral ground; all we measured was how consistently the realtime model could stick to its scripted character and drive the conversation the way the prompt demanded.
The model was provided with the following instruction
You are a USER SIMULATION interacting with a customer support agent.{instruction_display}
Important Rules:
- Just generate one voice line at a time to simulate the user's message. ALWAYS RESPOND IN ENGLISH LANGUAGE.
- Do not give away all the instructions at once. Only provide the information that is necessary for the current step.
- Do not hallucinate information that is not provided in the instruction. For example, if the agent asks for the order id but it is not mentioned in the instruction, do not make up an order id, just say you do not remember or have it.
- Do not repeat the exact instructions in the conversation. Instead, use your own words to convey the same information.
- Try to make the conversation as natural as possible and stick to the personalities in the instruction.
- IF THE INSTRUCTION GOAL IS SATISFIED, SAY 'STOP AS A VOICE MESSAGE.BUT ONLY IF THE INSTRUCTION GOAL IS SATISFIED!!
Results
User Model | Assistant Model | Assistant Accuracy |
Gpt-4.1-mini | Gpt-4.1-mini | 46.0 |
Gpt-realtime | Gpt-4.1-mini | 63.0 |
Gpt-4.1 | Gpt-4.1-mini | 72.0 |
When the “customer” is simulated by the tiny gpt-4.1-mini, our well-vetted help-desk agent (also gpt-4.1-mini) completes barely 46 % of the retail tasks.
Swap that same customer persona to gpt-realtime and accuracy jumps 17 points to 63 %. In other words, the realtime model’s utterances are markedly clearer and more instruction-faithful than those produced by the smaller text model, making life much easier for the agent.
The gold standard is still a full-size gpt-4.1 user at 72 %, but gpt-realtime lands solidly in the middle. So even though better than the mini series model , it is still behind the the cutting edge models used to deliver top performance to our customer calls!
Agentic Capabilities
Methodology
After measuring how well gpt-realtime can follow instructions as a customer, we flipped the script and asked: How good is it when it’s the agent in charge—querying tools, giving policy-compliant answers, and speaking back in natural audio?
To find out, we ran TauBench’s retail tasks under three different input conditions. Each condition isolates a different stress-point: reasoning depth, speech-handling, and robustness to noisy audio.
Mode A — Text ➜ Voice (Clean Baseline)
- User Model: gpt-4.1 (text-only).
- Conversation Flow: The user writes text; we feed this directly to gpt-realtime; it can speak back and invoke tools.
- Why Run It: Gives an “apples-to-apples” baseline against existing text-first agents—any accuracy gap here is pure reasoning, not ASR noise.
Mode B — Voice ➜ Voice (Full Realtime)
- User Model: a second instance of gpt-realtime, forced to stay in character as the shopper.
- Conversation Flow: Audio streams both ways; the assistant hears raw speech, reasons, then talks back.
- Why Run It: Mimics the real customer-service scenario callers will experience—but introduces two new error sources: ASR drift in the assistant and imperfect language in the user simulation.
Mode C — Text + TTS ➜ Voice (Clean Audio, Strong User)
- User Model: gpt-4.1 generates text; we pass that text through gpt-4o-mini-TTS to create high-quality audio.
- Conversation Flow: Assistant hears pristine, well-pronounced speech, then answers in voice.
- Why Run It: Removes “mumbly customer” uncertainty while still exercising the assistant’s speech-recognition front end. If performance rebounds relative to Mode B, the blame lies mainly with low-fidelity user speech, not the assistant’s reasoning.
Interpreting the Scores
- If gpt-realtime matches its text peers in Mode A ➜ reasoning depth is on par.
- Drop-offs in Mode B pinpoint how compounded ASR+TTS noise hurts the system end-to-end.
- Recovery in Mode C shows whether the assistant itself or the noisy user is the bottleneck.
This three-pronged approach lets us tease apart “brainpower,” “ears,” and “mouth” inside a single voice-first agent model!!
Results
User Model | Assistant Model | Input Modality | Average Turns | Agent Accuracy |
Gpt-4.1 | Gpt-4.1-mini | Text | 12 | 0.7 |
Gpt-4.1 | Gpt-4.1 | Text | 14 | 0.81 |
Gpt-4.1 | Gpt-realtime | Text | 16 | 0.7 |
Gpt-realtime | Gpt-realtime | Voice | 20 | 0.50 |
Gpt-4.1+TTS | Gpt-realtime | Text + TTS | 17 | 0.65 |
What jumps out of the numbers is that gpt-realtime’s “brain” is basically on par with a Gpt-4.1-mini when the conversation comes in as clean text, but its “ears” still have some growing up to do.
In the text-only rows, gpt-realtime posts a 0.70 task-completion score—identical to gpt-4.1-mini and only 0.11 below full-size gpt-4.1—so the core reasoning stack clearly isn’t the bottleneck. The trouble appears when we let two realtime instances talk to each other in raw audio: accuracy tumbles to 0.50 and the dialogues stretch to 20 turns, a hint that ASR drift and less-precise phrasing force extra back-and-forth to get the job done.
As soon as we swap the “mumbly” realtime user for a crisp gpt-4.1 script rendered through high-quality TTS, the score rebounds to 0.65 and the conversation shortens, telling us most of the penalty came from noisy input, not from the assistant’s planning or tool-use logic.
In short, gpt-realtime can think almost as well as its text cousins, but it still mishears just enough to slow itself down when both sides are speaking live.
Conclusion
OpenAI’s gpt-realtime is undeniably exciting—shaving full seconds off the call-center dance and wrapping ASR, reasoning, and TTS into a single, streaming endpoint. In our latency tests it already feels “snappy-human,” and when it reads pristine text it solves customer-service tasks almost as well as a lite GPT-4. But the moment both sides start talking in messy, true-to-life audio, cracks appear: accuracy falls, conversations drag on, and the model occasionally mishears critical details.
What the numbers tell us is simple: the core reasoning engine is there, but the acoustic front end still trips on colloquialisms and imperfect mic quality. Until those ears get sharper—and the model learns to compensate for its own transcription wobble—gpt-realtime will shine in tightly controlled settings (IVR triage, scripted voice bots, embedded gadgets) yet remain a doubtful bet for high-stakes, free-form conversations.