The language model (LLM) you choose determines response quality, latency, capability, and cost. XUNA AI Conversational AI supports models from OpenAI, Google, Anthropic, and XUNA AI-hosted open models, as well as custom endpoints you host yourself.
Supported models
| Provider | Model |
|---|
| XUNA AI | GLM-4.5-Air, Qwen3-30B-A3B, GPT-OSS-120B |
| Google | Gemini 3 Pro Preview, Gemini 3 Flash Preview, Gemini 2.5 Flash, Gemini 2.5 Flash Lite, Gemini 2.0 Flash, Gemini 2.0 Flash Lite |
| OpenAI | GPT-5, GPT-5 Mini, GPT-5 Nano, GPT-4.1, GPT-4.1 Mini, GPT-4.1 Nano, GPT-4o, GPT-4o Mini, GPT-4 Turbo, GPT-3.5 Turbo |
| Anthropic | Claude Sonnet 4.5, Claude Sonnet 4, Claude Haiku 4.5, Claude 3.7 Sonnet, Claude 3.5 Sonnet, Claude 3 Haiku |
Certain models are available for HIPAA-compliant environments. Contact XUNA AI support to confirm which models are covered under your Business Associate Agreement.
Temperature
Temperature controls how creative or conservative the model’s responses are. Lower values produce more predictable, factual responses. Higher values produce more varied, creative ones.
| Range | Label | Best for |
|---|
| 0.0 – 0.3 | Low | Customer support, FAQ bots, tasks requiring accuracy |
| 0.4 – 0.7 | Medium | General-purpose assistants, balanced creativity and accuracy |
| 0.8 – 1.0 | High | Creative applications, brainstorming, entertainment |
Start at medium (0.5) and adjust based on how the agent responds in testing.
Backup LLM
If the primary model is unavailable, the backup LLM takes over to keep conversations running.
| Option | Behavior |
|---|
| Default | XUNA AI automatically selects a reliable fallback model |
| Custom | You specify which model to use as the fallback |
| Disabled | No fallback — conversation fails if the primary model is unavailable |
Disabling the backup LLM is strongly discouraged for production agents. A model outage will immediately interrupt live conversations with no recovery path.
Thinking budget
The thinking budget controls how much reasoning a model performs before generating a response. Models with extended thinking enabled can handle more complex, multi-step tasks.
| Setting | Behavior |
|---|
| Disabled | No extended thinking; fastest responses |
| Low | Light reasoning; suitable for most conversational tasks |
| Medium | Moderate reasoning; good for structured data tasks |
| High | Deep reasoning; best for complex analysis or workflows |
Extended thinking adds latency. For real-time voice conversations, keep this low or disabled.
Reasoning effort (workflow steps)
For agents that use visual workflow steps (rather than single-turn conversation), you can set the reasoning effort per step.
| Setting | When to use |
|---|
| None | Real-time conversation nodes — always use this to minimize latency |
| Low | Simple classification or routing steps |
| Medium | Multi-condition logic or data transformation |
| High | Complex analysis, synthesis, or decision-making steps |
Do not enable reasoning effort for real-time conversation nodes. It introduces significant latency that degrades the voice experience.
Custom LLM
If you want to use a model not on the supported list — including fine-tuned or self-hosted models — you can point the agent at your own endpoint.
Expose an OpenAI-compatible endpoint
Your endpoint must implement the OpenAI Chat Completions API contract (POST /chat/completions).
Configure the agent
Provide the endpoint URL and any required credentials.
from xuna_ai import XunaAI
client = XunaAI()
agent = client.conversational_ai.agents.update(
agent_id="your-agent-id",
conversation_config={
"agent": {
"llm": {
"model_id": "custom",
"custom_llm": {
"url": "https://your-llm-endpoint.example.com/chat/completions",
"extra_headers": {
"Authorization": "Bearer YOUR_API_KEY"
}
},
"temperature": 0.5,
}
}
}
)
Pricing
LLM usage is billed per 1 million tokens, with separate rates for input tokens, output tokens, and cached input tokens. Pricing varies by model. Check the XUNA AI pricing page for current rates.
Cached input tokens (e.g., the system prompt portion of your context) are billed at a lower rate than uncached input tokens on supported models. Long, stable system prompts benefit automatically from caching.