Blog · AI Engineering

Voice agents architecturally

A telephone voice assistant that understands callers, drives back-office procedures, and does not sound like a robot from 2008 — the building blocks are all available today. What matters is the architecture: how speech, models, and back-end systems combine into a pipeline with sub-second response time. A fifteen-page deep dive — pipeline anatomy, VAPI as orchestrator, ElevenLabs for the voice, Anthropic and OpenAI for intelligence, telephony over SIP, EU data protection, and the bridge to government back-office procedures. The third part of our series for regulated industries, with an explicit reference to our own product CityAI.

Positioning — what a voice agent can do today

A voice agent is a programmable speech assistant that takes or initiates phone calls, participates in the conversation, and operates back-end systems — booking appointments, answering enquiries, providing information, opening tickets. What was a research demo three years ago is today a productive piece of software, built in a matter of weeks, with a response quality that on roughly 70 percent of calls is indistinguishable from a trained service representative.

At Tenvias, this is not a theoretical exercise. Our own product CityAI is a voice agent for municipal citizen telephony — a telephone service channel that handles calls about ID cards, registrations, appointments, general information, or public-transit questions, clarifies them, and either answers directly or routes them to a clerk. The content of this article is drawn from operating that platform day in, day out.

Three factors enabled the jump from demo to production fitness. First: the latency of modern streaming providers has dropped to a level that allows natural conversation — under a second from the end of a question to the first syllable of the answer. Second: modern LLMs treat function calling as a first-class feature, making them usable as the bridge between free speech and structured back-end APIs. Third: specialised TTS providers like ElevenLabs deliver synthetic voices that, in blind listening tests, are no longer reliably distinguishable from recordings — gone are the days of monotone reading-aloud acoustics from the speech portals of the early 2010s.

What separates this article from the marketing variant: we show the architecture layer by layer, name concrete providers with their strengths and weaknesses, print code samples you can adopt, and address the sensitive topics openly — data protection, latency limits, the question of "direct LLM provider versus aggregator", the integration with municipal back-office procedures with FIM keys. Anyone who wants to build their own voice service hotline by the end of this has the blueprint in hand.

The pipeline — STT, LLM, TTS

Every voice agent — regardless of provider or platform — follows the same three-step architecture. Speech becomes text, text becomes response, response becomes speech again. What differentiates providers is not the concept but the depth of implementation.

STT — speech to text

The first stage turns incoming audio into text. Classically a neural network that consumes audio frames as mel spectrograms and emits tokens sequentially. Modern providers — Deepgram (Nova-3), AssemblyAI (Universal-2), OpenAI Whisper, Azure Speech, Google Cloud STT — deliver streaming transcripts: while the caller is still speaking, partial text fragments come back and refine after every new word. That is decisive for latency: the next pipeline stage can start with the transcript as soon as the caller's utterance ends — not only after a batch translation finishes.

Three related terms belong to the same stage. VAD (voice activity detection) recognises whether someone is currently speaking. Endpointing decides when an utterance is finished — typically after a 200–500 ms pause. Too aggressive endpointing chops sentences; too defensive endpointing makes the agent sluggish. Barge-in describes the caller's ability to interrupt the agent — a property absolutely required in productive setups because it draws the line between "I'm talking to a machine" and "I'm talking to a counterpart".

LLM — the speech intelligence

The second stage receives the transcript, keeps a conversation history, generates the response, and decides whether to call a tool. The Large Language Model lives here — by default Anthropic Claude in our setups (Claude Sonnet 4.5 for most use cases, occasionally Claude Opus for particularly demanding domains), in alternative configurations OpenAI (GPT-4o, GPT-4.1) or Google Gemini. Three properties matter: German conversational quality, tool-use discipline, streaming latency.

TTS — text to speech

The third stage turns the LLM's response back into synthetic speech. Since around 2023, the market has been dominated by ElevenLabs — audio quality that is hard to distinguish from human recordings in blind tests, combined with a streaming API whose first audio frames arrive in 150–300 ms. Alternatives are OpenAI TTS (broad availability, but stylistically less flexible), Azure Neural Voices (regionally solid, but with higher latency), or the emerging Cartesia Sonic with particularly low latency.

Important: the TTS stage must be streaming-capable, otherwise it ruins the entire latency budget. A batch TTS that waits for the complete LLM output and then returns a finished MP3 file costs an additional 800 ms — unacceptable for conversation. Streaming TTS receives text fragments from the LLM and produces audio frames in parallel; the first syllables of the answer are audible while the LLM is still composing further sentences.

Latency budget — why sub-second is the benchmark

End-to-end latency is the most important technical metric of a voice agent. It decides whether the conversation feels natural or sluggish. Rules of thumb from telephony research: under 500 ms is excellent, 500–800 ms is good, 800–1200 ms is acceptable, anything above 1500 ms is perceived as disruptive — callers start speaking into the silence, which throws off recognition further.

Figure 1 — Latency waterfall of a voice-agent response from the end of the caller's utterance to the first audible syllable. Streaming-capable pipelines let STT, LLM, and TTS overlap — the TTS stage starts while the LLM is still generating. A setup without streaming sums the same stages sequentially and typically lands at 1.8–2.2 seconds.

Optimisation levers

Four knobs move the latency budget the most. First: streaming end to end. STT must deliver partials, LLM must support token streaming, TTS must accept text chunks. Any batch step in between loses the game. Second: aggressive endpointing. A 300 ms pause threshold instead of 500 ms saves 200 ms from the total — at the cost of occasionally chopped sentences. The configuration is application-specific: short for structured queries (appointment, address), longer for open enquiries. Third: provider locality. Anthropic, OpenAI, Deepgram, and ElevenLabs operate EU endpoints — the extra round-trip to US-East-1 costs 80–120 ms per call. Fourth: parallel pipeline stages. The moment the LLM emits the first words, TTS synthesis begins — not after the full response is ready.

When the budget is blown

Sometimes latency cannot be pushed further down — complex tool calls against slow back-office APIs are the most common cause. The fix is a two-stage response plan: the LLM emits a brief filler phrase ("one moment, let me check") before the tool call starts. Two seconds of silence become two seconds of conversation — the difference in caller perception is dramatic.

Architecture — all the pieces in one picture

Before going into the individual components, a full picture. The figure below shows the typical topology of a productive voice-agent setup, as CityAI deploys in client installations.

Figure 2 — Voice-agent architecture in productive topology: the caller reaches the VAPI voice-agent platform through a SIP trunk; VAPI orchestrates the external model providers (Deepgram, Anthropic/OpenAI, ElevenLabs). Tool calls to back-office procedures run through the function bridge with dedicated authentication and tracing. All arrows show the initiating call — responses flow back along the same paths.

The pipeline runs as follows in the normal case: (1) the caller dials the service number, the carrier routes the SIP INVITE to the SIP trunk provider. (2) The SIP trunk forwards the call — via VAPI's WebSocket API — to the voice-agent platform; RTP audio packets are turned into real-time frames. (3) VAPI opens a streaming channel to the STT provider and starts pushing audio upstream; transcript partials come back. (4) Once VAPI detects the end of the caller's utterance, it hands the finalised transcript to the LLM. (5) The LLM streams back either pure response text (the default) or a tool call. (6) For a tool call, the function bridge routes the request to the back-office procedure, collects the result, and feeds it as context into the next LLM call. (7) The final response text is streamed in chunks to the TTS, which produces audio frames; VAPI mixes them into the RTP stream back to the caller.

VAPI as orchestrator

The voice-agent platform VAPI (vapi.ai) has established itself in our projects as the most pragmatic way to build a productive voice assistant in weeks rather than months. VAPI takes on the tasks any self-built voice agent eventually has to solve — and leaves the domain decisions to the developer.

What VAPI does

At its core, VAPI is a multi-provider orchestrator: the platform abstracts over STT, LLM, and TTS providers and offers a uniform configuration layer. Specifically, it solves five problems. First: SIP integration — VAPI talks natively to common SIP trunk providers (Twilio, Telnyx, Sipgate); both inbound (caller dials a number) and outbound (agent calls a citizen back) are supported. Second: audio routing with VAD, endpointing, and barge-in — three non-trivial components in any productive conversation. Third: provider abstraction — switching from Deepgram to Whisper or from Claude to GPT-4 takes a configuration change, not a code change. Fourth: function call bridge — tools are registered as webhook URLs; VAPI orchestrates the call, response handling, and re-prompting of the LLM. Fifth: call logs, recordings, and transcripts for operational analysis and audit requirements in regulated industries.

Configuration as an assistant definition

A voice agent in VAPI is defined as an "assistant" — a JSON object that describes all the pipeline's components at once:

{
  "name": "citizen-line-exampletown",
  "firstMessage": "Exampletown citizen service. How can I help you?",

  "transcriber": {
    "provider": "deepgram",
    "model":    "nova-3",
    "language": "en",
    "endpointing": 350
  },

  "model": {
    "provider": "anthropic",
    "model":    "claude-sonnet-4-5",
    "temperature": 0.3,
    "maxTokens":   400,
    "systemPrompt": "You are the telephone service of the city of Exampletown. \
                      Speak only English. \
                      Be precise and friendly. \
                      Ask targeted follow-ups if the request is unclear. \
                      Route to a human agent for legal or medical emergencies.",
    "tools": [
      { "type": "function", "function": { "name": "get_appointment_availability", ... } },
      { "type": "function", "function": { "name": "book_appointment",          ... } },
      { "type": "function", "function": { "name": "route_to_human",             ... } }
    ]
  },

  "voice": {
    "provider":               "11labs",
    "voiceId":                "EXAVITQu4vr4xnSDxMaL",
    "model":                  "eleven_flash_v2_5",
    "stability":              0.5,
    "similarityBoost":        0.75,
    "optimizeStreamingLatency": 3
  },

  "serverUrl": "https://voice-agent.tenvias.com/webhooks/vapi",
  "recordingEnabled": false,
  "maxDurationSeconds": 600,
  "endCallPhrases": ["goodbye", "thank you very much", "bye"]
}

Three properties of this configuration matter in practice. First: the system prompt language. We always pin the agent to the target language explicitly — otherwise LLMs trained mostly on English text occasionally slip into English phrasing. Second: endpointing: 350 as the default — a midpoint between "react fast" and "let the caller finish". For applications with older audiences we raise this to 500–600 ms. Third: recordingEnabled: false as the default — recordings are only enabled with explicit consent (see the data-protection section).

Tool definitions

Each tool is an HTTPS URL VAPI calls when a function is invoked. The definition follows the JSON Schema convention understood by both Anthropic and OpenAI — VAPI translates between provider formats transparently. A tool specification has three parts: name, description (decides how the LLM understands the tool — the quality of this description is match-critical), and parameters with the expected arguments.

ElevenLabs for TTS

Synthetic speech is, in 2025, for the first time available at a quality that is not reliably distinguishable from human recordings in double-blind listening tests. ElevenLabs is the market leader here — both in audio quality and in the streaming latency that voice agents require.

Models and their trade-offs

ElevenLabs currently offers three model families that differ in quality and latency. Flash v2.5 is the real-time model — first audio frames in 75–150 ms, slightly reduced prosody compared with the higher models, but optimised for conversation. Turbo v2.5 sits in the middle — about 200 ms latency, markedly more natural accents. Multilingual v2 is the quality model — 32 languages, best voice fidelity, but 300–500 ms time-to-first-byte. For CityAI we use Flash v2.5 by default; the few places where longer pre-generated announcements play (e.g. legal notices about call recording) are produced with the multilingual model and cached.

Voice cloning for a consistent brand voice

One of the more relevant features for CityAI: instant voice cloning allows cloning a voice from 60–90 seconds of speech samples. For municipalities, this is valuable because a consistent "civic service voice" can be established across all channels (voice agent, speech portal, on-hold announcements, explainer videos). Our recommendation: do not use a real person as the source but a professional voice talent with an explicit licence agreement — avoiding the non-trivial legal questions around synthetic reproduction of a real voice.

Streaming API

The crucial endpoint for voice agents is the WebSocket-based streaming API. Text is streamed in chunks from the LLM, ElevenLabs returns audio frames once enough text is available for a sensible synthesis. A minimal-functional client in Node.js:

import WebSocket from 'ws';

const ws = new WebSocket(
  `wss://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream-input` +
  `?model_id=eleven_flash_v2_5&optimize_streaming_latency=3`,
  { headers: { 'xi-api-key': process.env.ELEVENLABS_API_KEY } }
);

ws.on('open', () => {
  // Initial frame: voice settings and generation config
  ws.send(JSON.stringify({
    text: ' ',                                       // mandatory leading space
    voice_settings:    { stability: 0.5, similarity_boost: 0.75 },
    generation_config: { chunk_length_schedule: [120, 160, 250, 290] }
  }));

  // Send LLM tokens as they arrive
  llmStream.on('text-chunk', (text) => {
    ws.send(JSON.stringify({ text, try_trigger_generation: true }));
  });

  // EOS signal when the LLM is done
  llmStream.on('end', () => {
    ws.send(JSON.stringify({ text: '' }));
  });
});

ws.on('message', (data) => {
  const msg = JSON.parse(data);
  if (msg.audio) {
    const audioBytes = Buffer.from(msg.audio, 'base64');
    sipStream.write(audioBytes);                     // straight into the RTP stream
  }
  if (msg.isFinal) {
    ws.close();
  }
});

Two parameters control the latency-quality trade-off. optimize_streaming_latency (0–4) — higher values prioritise lower latency at slightly reduced synthesis quality. We usually set 3. chunk_length_schedule — specifies how many characters ElevenLabs gathers before generating an audio chunk. Shorter for lower latency, longer for smoother prosody.

Pricing and data residency

ElevenLabs offers an EU endpoint (EU-Frankfurt) — relevant for data-protection requirements in regulated setups. Pricing is character-based, in the low per-mille range per second of audio; a typical voice agent lands at 2–5 cents per minute of conversation, which has so far been the smaller line item compared with LLM costs.

STT selection — Deepgram, Whisper, and the competition

On the STT side, the market is more heterogeneous than for TTS. Four providers are relevant in regulated industries.

Deepgram

Our default choice for CityAI. Deepgram Nova-3 delivers transcripts with streaming partials in 100–200 ms, has very good recognition rates even with regional accents, and offers an EU hosting option (EU-Frankfurt). Pricing at about 0.4 cents per minute; the lowest on the market for comparable quality. Native telephony codec support (μ-law, A-law) avoids transcoding.

OpenAI Whisper

Outstanding recognition quality, broad language coverage, but two hard limitations for voice agents: first, no native streaming (Whisper-1 is a batch API endpoint that waits for the complete audio recording); second, comparatively high latency (500 ms to 1 s after audio end). For use cases where streaming matters, Whisper is out as live STT — but it remains the right choice for asynchronous transcription (call evaluation, audit reviews, training material).

AssemblyAI and Azure Speech

AssemblyAI Universal-2 is optimised specifically for conversational use cases, with good streaming performance and integrated features such as speaker diarisation and sentiment analysis — relevant when a second analytical layer is meant to sit on top of transcripts. Azure Speech is the hyperscaler answer — solid quality, Microsoft's own EU hosting, good integration into the Microsoft cosmos.

Selection criteria in practice

Four questions decide in the Tenvias method. (1) Streaming or not — voice agents demand streaming. (2) Recognition rate on real telephone recordings (not studio audio!) — we measure word error rate against an internal validation set of about 200 caller recordings. (3) Latency including network round-trip to the EU endpoint. (4) Data residency and processing agreement — all four providers listed above offer EU hosting, but contractual depth (processing for training, retention periods, sub-processors) differs substantially.

Anthropic Claude — the speech intelligence

The LLM is the heart of the voice agent. It runs the conversation, keeps context, decides on tool calls, and formulates the final response. At CityAI we use Anthropic Claude by default — for three reasons covered below.

Three reasons for Claude

First: German conversation at a high level. Claude is noticeably more natural in German B2C conversations than GPT-4 — smoother sentences, fewer anglicisms, better register in administrative language. A subjective observation that nevertheless reproduces consistently in our A/B tests. Second: tool use as a first-class feature. Claude separates text response from tool call cleanly, is disciplined in parameter extraction, and offers a clearly defined response format that is predictable to parse in voice pipelines. Third: Anthropic offers a real EU contract with a clear data-processing agreement, EU hosting on AWS Frankfurt, and defined retention periods — critical for setups in regulated industries.

Tool use with streaming

A complete call against the Anthropic API with tools and streaming in Python:

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_appointment_availability",
        "description": (
            "Searches free appointments at the citizen office by purpose and date range. "
            "Use this tool as soon as the caller expresses a concrete appointment wish "
            "for a specific purpose."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "purpose": {
                    "type": "string",
                    "enum": ["id_card", "registration",
                             "driving_licence", "marriage"],
                    "description": "Type of request"
                },
                "from_date": {
                    "type": "string",
                    "format": "date",
                    "description": "Earliest acceptable date"
                }
            },
            "required": ["purpose"]
        }
    },
    # ... further tools (book_appointment, route_to_human)
]

system_prompt = (
    "You are the telephone service of the city of Exampletown. "
    "Speak English clearly and politely. "
    "Ask focused follow-ups when a request is unclear. "
    "Route to a human agent for medical emergencies."
)

with client.messages.stream(
    model       = "claude-sonnet-4-5",
    max_tokens  = 400,
    system      = system_prompt,
    tools       = tools,
    messages    = conversation_history,
    temperature = 0.3,
) as stream:
    for event in stream:
        if event.type == "content_block_start":
            block = event.content_block
            if block.type == "tool_use":
                current_tool_call = { "name": block.name, "input": "" }

        elif event.type == "content_block_delta":
            delta = event.delta
            if delta.type == "text_delta":
                # Stream the token straight into the TTS
                tts.send_chunk(delta.text)
            elif delta.type == "input_json_delta":
                # Tool input arrives incrementally
                current_tool_call["input"] += delta.partial_json

        elif event.type == "content_block_stop":
            if current_tool_call:
                # Execute the tool and feed the result into the next turn
                result = execute_tool(current_tool_call)
                conversation_history.append({
                    "role": "assistant",
                    "content": [current_tool_call]
                })
                conversation_history.append({
                    "role": "user",
                    "content": [{
                        "type": "tool_result",
                        "tool_use_id": current_tool_call["id"],
                        "content": result
                    }]
                })

Four points are practice-relevant here. First: the description of every tool is match-critical for recognition. We write it in complete sentences with clear triggers ("Use this tool as soon as …") — a terse description like "books appointments" leads to noticeably more mis-applications. Second: temperature=0.3 as the default — low enough for consistent answers, high enough to avoid sounding robotic. Third: the streaming event model lets you send text responses straight to TTS and collect tool calls incrementally — both paths share one event stream. Fourth: tool results must be inserted as a "user" message with type tool_result into the history, otherwise the LLM loses context.

Prompt caching for conversation history

Anthropic supports prompt caching that directly saves money in voice agents: the long system prompt with all tool definitions is cached once per conversation, after which only the new conversation turns are billed. On a five-minute call with 20 turns, this typically cuts token cost by 70–80 percent.

OpenAI Realtime API — the speech-to-speech alternative

OpenAI introduced another architecture with the Realtime API: no separate STT/LLM/TTS, but a single endpoint that takes audio in and delivers audio out. The model hears, thinks, and speaks in one step — yielding a conversational quality that beats the classical pipeline on prosody and speech flow.

Pros and cons

Pro: lower latency (no three separate API calls), more natural conversation (the model "hears" speech tempo, intonation, emotion and responds accordingly), built-in barge-in handling, simpler programming model. Con: vendor lock-in to OpenAI, less control over individual pipeline stages, the voice is limited to OpenAI's predefined voices (no custom clones), the data-protection situation for EU customers is more complicated than with Anthropic. The latter is the most common reason we lean towards the classical pipeline in regulated industries — but for internal setups or applications with less sensitive data, the Realtime API is a serious candidate.

WebSocket setup

The Realtime API is a bidirectional WebSocket connection. Configuration and audio streaming in one snippet:

import WebSocket from 'ws';

const ws = new WebSocket(
  'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview',
  {
    headers: {
      'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
      'OpenAI-Beta':   'realtime=v1'
    }
  }
);

ws.on('open', () => {
  // Session configuration
  ws.send(JSON.stringify({
    type: 'session.update',
    session: {
      modalities:           ['audio', 'text'],
      voice:                'shimmer',
      input_audio_format:   'g711_ulaw',     // SIP standard
      output_audio_format:  'g711_ulaw',
      input_audio_transcription: { model: 'whisper-1' },
      instructions: 'You are the telephone service of the city of ' +
                    'Exampletown. Speak English politely.',
      turn_detection: {
        type:                'server_vad',
        threshold:           0.5,
        prefix_padding_ms:   300,
        silence_duration_ms: 500
      },
      tools: [
        {
          type: 'function',
          name: 'get_appointment_availability',
          description: 'Searches free appointments at the citizen office ...',
          parameters: { /* JSON Schema as with Anthropic */ }
        }
      ]
    }
  }));
});

// Pump incoming SIP audio directly into the realtime session
sipStream.on('audio', (audioFrame) => {
  ws.send(JSON.stringify({
    type:  'input_audio_buffer.append',
    audio: audioFrame.toString('base64')
  }));
});

// Outbound audio frames back to SIP
ws.on('message', (data) => {
  const event = JSON.parse(data);
  if (event.type === 'response.audio.delta') {
    const audioBytes = Buffer.from(event.delta, 'base64');
    sipStream.write(audioBytes);
  }
  if (event.type === 'response.function_call_arguments.done') {
    // The tool call has finished collecting arguments
    const result = executeToolCall(event.name, event.arguments);
    ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type:    'function_call_output',
        call_id: event.call_id,
        output:  JSON.stringify(result)
      }
    }));
    ws.send(JSON.stringify({ type: 'response.create' }));
  }
});

Three peculiarities of this setup. First: the audio format g711_ulaw is the standard telephony codec — no transcoding needed, saving 20–50 ms of latency. Second: turn_detection: server_vad means OpenAI handles VAD — convenient but less configurable than your own VAD layer. Third: tool calls work analogously to Anthropic — the model "speaks" a function call, the server executes it, the result flows back, the model continues its response.

When realtime, when classical pipeline

Our rule of thumb: Realtime API for short, dialogue-heavy applications with lower data sensitivity (demo setups, internal tools, B2C applications without administrative context). Classical pipeline for regulated setups, where custom voices are required, and wherever future provider swapping should remain feasible.

Telephony and SIP integration

Before a voice agent can speak at all, it has to be reachable. The SIP/telephony layer is conceptually trivial but in practice the place where most regulatory and integration stumbling blocks live.

Provider landscape

Twilio is the international standard — broad feature set, good API, stable for years, US-centric. For CityAI in German municipalities not first choice because of the data-residency question. Sipgate is the German standard — full EU hosting, good localisation, acceptable prices, somewhat less polished API. Telnyx is the cost-effective alternative with good latency and EU presence, though without a German contractual entity. Self-hosted FreeSWITCH / Asterisk SBCs are the option for customers who source SIP trunks themselves from Deutsche Telekom or a group-owned carrier and run a session border controller in their own data centre — typical in banking environments with their own telephony infrastructure.

DID routing and number porting

A productive voice-agent rollout is delayed not by the software but by two regulatory steps: number porting of an existing service number to the SIP trunk provider (six to twelve weeks, depending on the relinquishing carrier) and the direct-inbound-dial (DID) configuration that defines which inbound calls route to which endpoint. Both belong in any rollout plan as separate workstreams — otherwise you end up with finished software and no callers.

Codec choice

SIP telephony traditionally uses two codecs: G.711 (μ-law / A-law) at 64 kbit/s — historical ISDN quality, broadly supported, no speech degradation. Opus at 24–48 kbit/s — modern, better trade-off between bitrate and quality, but not universally supported. For voice agents in Germany, G.711 is the safe choice; Opus pays off when the entire path (SIP trunk, VAPI, STT, TTS) speaks Opus end to end.

OZG integration, data protection, and EU hosting

In the regulated industries Tenvias works in, the technical pipeline is only part of the task — legal and integration hardening matters at least as much. Three topics decide productive viability.

Data protection and GDPR

Call audio almost always contains personal data — the sound of a voice is itself a biometric attribute under Art. 9 GDPR. Four obligations follow. (1) Notice at the start of the conversation: information that the caller is talking to an AI-supported voice assistant, with the option to be routed to a human at any time. (2) Data-processing agreement with every pipeline provider — SIP trunk, VAPI, STT, LLM, TTS — and review of their sub-processors. (3) Data minimisation: audio frames are not persisted, transcripts are deleted automatically after 30 days, recordings only with explicit consent. (4) EU hosting: Anthropic AWS Frankfurt, OpenAI EU (where available), Deepgram EU, ElevenLabs EU-Frankfurt — all four providers have corresponding options that must be set explicitly in the API configuration.

OZG / municipal back-office integration

The bridge to the actual value of a voice agent — from "speaks politely" to "solves the request" — lies in integrating with municipal back-office procedures. Three standards are relevant in the German public sector. XÖV (XML in public administration) is the family of standardised data formats for inter-agency exchange. FIM (federal information management) provides keys and master data for administrative services — anyone working with FIM keys can identify requests uniquely across agency boundaries. BundID or a state's service account provides the citizen's authentication, once a process leaves anonymous information and has to be personally attributed.

A FIM-compliant tool call looks like this in the voice agent's implementation:

import httpx
from datetime import date

FIM_KEYS = {
    "id_card":          "99036006001000",   # FIM service key
    "registration":     "99010002001000",
    "driving_licence":  "99117002001000",
    "marriage":         "99089001001000",
}

def get_appointment_availability(purpose: str,
                                 from_date: str | None = None) -> dict:
    """Tool implementation: query appointments through an XOEV-compliant API."""
    fim_key = FIM_KEYS.get(purpose)
    if fim_key is None:
        return { "error": f"Unknown purpose: {purpose}" }

    response = httpx.get(
        f"{APPOINTMENT_API}/availability",
        params={
            "serviceKey": fim_key,
            "from":       from_date or date.today().isoformat(),
            "limit":      10
        },
        headers={
            "Authorization":        f"Bearer {service_token}",
            "X-FIM-Conformity":     "1.0",
            "X-Trace-Id":           current_trace_id()
        },
        timeout=4.0
    )
    response.raise_for_status()

    slots = response.json()["appointments"]
    if not slots:
        return { "result": "no_slots",
                 "note":   "No free slot in the next 4 weeks." }

    return {
        "result":         "available",
        "next_slot":      slots[0]["date"],
        "alternatives":   [s["date"] for s in slots[1:5]]
    }

Four properties of this tool code matter. First: the translation from natural speech ("I need an ID card") to a FIM service key happens in code, not in the LLM prompt — that makes the administrative classification deterministic and auditable. Second: the X-Trace-Id header propagates the call ID into the back-office procedure — in an incident, the caller context can be reconstructed cleanly (see our article on logging in the enterprise). Third: the tight timeout=4.0 ensures a slow back-office API does not push the agent past the latency limit — when exceeded, the agent absorbs it with the "one moment, please" strategy. Fourth: the return value is structured as "result" codes that the LLM translates into natural-language answers — a clean separation between data retrieval and conversational surface.

Practical note

For call recordings — to the extent they are provided for at all — the strict consent regime applies. A blanket announcement "this call may be recorded for quality purposes" is not sufficient at a municipal service line. Instead: explicit question at the beginning ("Do you agree that this call may be recorded for quality assurance? Yes or no?"), default "no" on any ambiguity, and hard retention limits (in our setups: 14 days for recordings, 30 days for transcripts, automated anonymisation after that).

Operations, observability, and when a voice agent fits

A productive voice-agent installation is not a set-and-forget system. It carries operational requirements similar to any other platform — plus three peculiarities that are specific.

Monitoring and logging

Three metrics matter most. End-to-end latency per conversation turn — measured from the end of the caller's utterance to the first audible syllable. Tool-call success rate — how often a call to a back-office procedure fails, how often the LLM emits a tool call it should not have. Escalation rate — how often the conversation is routed to a human agent, and for what reason. These three values land in a Grafana dashboard (see our article on Zabbix and Grafana) with drill-down by day-of-week and request category.

Fallback to human operators

A productive voice agent must have a clean escalation strategy — both explicitly requested ("I'd like to talk to a person") and implicitly triggered (medical emergencies, legal questions, three consecutive tool-call failures, frustration markers in the caller's tone). The handover happens via the SIP REFER mechanism — the SIP trunk provider re-routes the call, ideally with a handover of the conversation summary as a screen pop-up at the operator (CRM integration).

When a voice agent fits

From our experience with CityAI and comparable projects: fits when (a) a high call volume with relatively structured concerns exists — appointments, status checks, general information, address changes; (b) 24/7 reachability is desired that cannot be staffed economically with humans; (c) recurring questions account for a significant share; (d) integration with existing back-office procedures via REST APIs or XÖV/FIM is technically possible.

Less suitable when (a) every request is individual and counselling-intensive; (b) crisis situations make up a non-trivial share — debt counselling, addiction support, psychosocial emergencies belong in human hands; (c) the constituency has an age profile that struggles with synthetic voices in principle; (d) the back-office procedures are not machine-accessible and every request requires manual back-end work.

Economic assessment

A realistic operation with 200 calls per day, averaging four minutes per call, runs at roughly 400–600 euro per month in provider fees (STT + LLM + TTS + SIP trunk). On top of that come the one-off setup cost and the ongoing maintenance of prompts, tools, and back-office integrations. Compared with a human 24/7 citizen service, that is an economically attractive constellation, provided the domain fit is established — which is what a careful use-case analysis at the start of any such project is for.

In a municipality we onboarded to CityAI in 2025, the share of calls fully answered by the agent stabilised at 68 percent after three months — the remaining 32 percent escalated to clerks via the routing logic, with the conversation log already containing the substantive pre-clarification. The service line was relieved of routine work without channelling demanding cases through a bot bottleneck. That balance — routine to the machine, complexity to the human — is the economic and ethical sweet spot of a productive voice agent.

Voice-agent pilot or productive rollout?

We review your caller landscape together with your team — call volume, request types, back-office integration, regulatory framing, latency requirements, escalation strategy. The result: a concrete action plan, tailored to the size and maturity of your service line, and a realistic schedule from first prototype to productive operations.

Arrange a conversation