Skip to main content
The WebSocket API is the lowest-level integration point for XUNA AI Conversational AI. Use it when you need to build integrations outside the browser — on a server, embedded hardware, or a platform not covered by the official SDKs. You send raw PCM audio and receive transcripts, agent audio, and conversation events over a persistent connection.

Connection

Connect to the WebSocket endpoint:
wss://api.xuna.ai/v1/convai/conversation
Public agents — append ?agent_id=YOUR_AGENT_ID to the URL:
wss://api.xuna.ai/v1/convai/conversation?agent_id=YOUR_AGENT_ID
Private agents — use a signed URL instead of the base endpoint. Generate the signed URL on your server (see Authentication) and connect to it directly:
wss://api.xuna.ai/v1/convai/conversation?token=SIGNED_TOKEN

Audio format

All audio sent to and received from the WebSocket uses the same format:
ParameterValue
EncodingPCM 16-bit signed integer
Sample rate16,000 Hz
ChannelsMono
Byte orderLittle-endian
Transport encodingBase64

Message types — client to server

Your client sends JSON messages to control the session and stream audio.

conversation_initiation_client_data

Send this as the first message after connecting to configure the session. All fields are optional.
{
  "type": "conversation_initiation_client_data",
  "conversation_config_override": {
    "agent": {
      "prompt": { "prompt": "You are a helpful assistant." },
      "first_message": "Hello! How can I help?",
      "language": "en"
    },
    "tts": {
      "voice_id": "your_custom_voice_id"
    }
  }
}

audio

Stream microphone audio to the server. Send chunks continuously while the user is speaking.
{
  "type": "audio",
  "audio_event": {
    "audio_base_64": "<base64-encoded PCM audio>"
  }
}

user_activity

Send this when you detect user activity (e.g., a keystroke or gesture) to signal that the user is present. Useful for non-audio interactions.
{
  "type": "user_activity"
}

pong

Respond to server ping messages to keep the connection alive.
{
  "type": "pong",
  "event_id": 123
}

Message types — server to client

The server sends JSON messages for conversation events and agent audio.

conversation_initiation_metadata

Sent immediately after the connection is established. Contains the conversation ID and the audio format the agent will use for output.
{
  "type": "conversation_initiation_metadata",
  "conversation_initiation_metadata_event": {
    "conversation_id": "conv_abc123",
    "agent_output_audio_format": "pcm_16000"
  }
}

audio

Agent speech as a base64-encoded PCM chunk. Decode and play it back in order.
{
  "type": "audio",
  "audio_event": {
    "audio_base_64": "<base64-encoded PCM audio>",
    "event_id": 1
  }
}

agent_response

The agent’s response text. Arrives alongside or slightly before the corresponding audio chunks.
{
  "type": "agent_response",
  "agent_response_event": {
    "agent_response": "Hello! How can I help you today?"
  }
}

user_transcript

Transcription of the user’s speech. Use this to display what the user said in your UI.
{
  "type": "user_transcript",
  "user_transcription_event": {
    "user_transcript": "What are your opening hours?"
  }
}

ping

Sent by the server periodically to verify the connection is alive. Respond with a pong message using the same event_id.
{
  "type": "ping",
  "ping_event": {
    "event_id": 123
  }
}

interruption

Sent when the user interrupts the agent mid-response. Stop playing any buffered audio chunks when you receive this message.
{
  "type": "interruption"
}

Example: Node.js client

The following example shows a minimal WebSocket client that connects, sends audio from a file, and logs transcripts and agent responses.
import WebSocket from 'ws';
import fs from 'fs';

const AGENT_ID = 'YOUR_AGENT_ID';
const ws = new WebSocket(
  `wss://api.xuna.ai/v1/convai/conversation?agent_id=${AGENT_ID}`
);

ws.on('open', () => {
  console.log('Connected');

  // Send session configuration
  ws.send(JSON.stringify({ type: 'conversation_initiation_client_data' }));

  // Stream audio from a file (replace with live microphone input)
  const audio = fs.readFileSync('input.raw');
  const chunkSize = 4096;
  let offset = 0;

  const interval = setInterval(() => {
    if (offset >= audio.length) {
      clearInterval(interval);
      return;
    }
    const chunk = audio.slice(offset, offset + chunkSize);
    ws.send(
      JSON.stringify({
        type: 'audio',
        audio_event: { audio_base_64: chunk.toString('base64') },
      })
    );
    offset += chunkSize;
  }, 100);
});

ws.on('message', (data) => {
  const message = JSON.parse(data.toString());

  switch (message.type) {
    case 'conversation_initiation_metadata':
      console.log('Conversation ID:', message.conversation_initiation_metadata_event.conversation_id);
      break;
    case 'user_transcript':
      console.log('User:', message.user_transcription_event.user_transcript);
      break;
    case 'agent_response':
      console.log('Agent:', message.agent_response_event.agent_response);
      break;
    case 'ping':
      ws.send(JSON.stringify({ type: 'pong', event_id: message.ping_event.event_id }));
      break;
    case 'interruption':
      console.log('Interrupted — clear audio buffer');
      break;
  }
});

ws.on('close', () => console.log('Disconnected'));
ws.on('error', (err) => console.error('Error:', err));

Summary of message types

DirectionTypePurpose
Client → Serverconversation_initiation_client_dataConfigure session overrides
Client → ServeraudioStream microphone audio
Client → Serveruser_activitySignal user presence
Client → ServerpongRespond to server ping
Server → Clientconversation_initiation_metadataSession ID and output audio format
Server → ClientaudioAgent speech chunks
Server → Clientagent_responseAgent response text
Server → Clientuser_transcriptUser speech transcript
Server → ClientpingKeepalive check
Server → ClientinterruptionUser interrupted the agent

Next steps