Choose and configure a language model for your agent

The language model (LLM) you choose determines response quality, latency, capability, and cost. XUNA AI Conversational AI supports models from OpenAI, Google, Anthropic, and XUNA AI-hosted open models, as well as custom endpoints you host yourself.

Supported models

Provider	Model
XUNA AI	GLM-4.5-Air, Qwen3-30B-A3B, GPT-OSS-120B
Google	Gemini 3 Pro Preview, Gemini 3 Flash Preview, Gemini 2.5 Flash, Gemini 2.5 Flash Lite, Gemini 2.0 Flash, Gemini 2.0 Flash Lite
OpenAI	GPT-5, GPT-5 Mini, GPT-5 Nano, GPT-4.1, GPT-4.1 Mini, GPT-4.1 Nano, GPT-4o, GPT-4o Mini, GPT-4 Turbo, GPT-3.5 Turbo
Anthropic	Claude Sonnet 4.5, Claude Sonnet 4, Claude Haiku 4.5, Claude 3.7 Sonnet, Claude 3.5 Sonnet, Claude 3 Haiku

Certain models are available for HIPAA-compliant environments. Contact XUNA AI support to confirm which models are covered under your Business Associate Agreement.

Temperature

Temperature controls how creative or conservative the model’s responses are. Lower values produce more predictable, factual responses. Higher values produce more varied, creative ones.

Range	Label	Best for
0.0 – 0.3	Low	Customer support, FAQ bots, tasks requiring accuracy
0.4 – 0.7	Medium	General-purpose assistants, balanced creativity and accuracy
0.8 – 1.0	High	Creative applications, brainstorming, entertainment

Start at medium (0.5) and adjust based on how the agent responds in testing.

Backup LLM

If the primary model is unavailable, the backup LLM takes over to keep conversations running.

Option	Behavior
Default	XUNA AI automatically selects a reliable fallback model
Custom	You specify which model to use as the fallback
Disabled	No fallback — conversation fails if the primary model is unavailable

Disabling the backup LLM is strongly discouraged for production agents. A model outage will immediately interrupt live conversations with no recovery path.

Thinking budget

The thinking budget controls how much reasoning a model performs before generating a response. Models with extended thinking enabled can handle more complex, multi-step tasks.

Setting	Behavior
Disabled	No extended thinking; fastest responses
Low	Light reasoning; suitable for most conversational tasks
Medium	Moderate reasoning; good for structured data tasks
High	Deep reasoning; best for complex analysis or workflows

Extended thinking adds latency. For real-time voice conversations, keep this low or disabled.

Reasoning effort (workflow steps)

For agents that use visual workflow steps (rather than single-turn conversation), you can set the reasoning effort per step.

Setting	When to use
None	Real-time conversation nodes — always use this to minimize latency
Low	Simple classification or routing steps
Medium	Multi-condition logic or data transformation
High	Complex analysis, synthesis, or decision-making steps

Do not enable reasoning effort for real-time conversation nodes. It introduces significant latency that degrades the voice experience.

Custom LLM

If you want to use a model not on the supported list — including fine-tuned or self-hosted models — you can point the agent at your own endpoint.

Expose an OpenAI-compatible endpoint

Your endpoint must implement the OpenAI Chat Completions API contract (POST /chat/completions).

Configure the agent

Provide the endpoint URL and any required credentials.

from xuna_ai import XunaAI

client = XunaAI()

agent = client.conversational_ai.agents.update(
    agent_id="your-agent-id",
    conversation_config={
        "agent": {
            "llm": {
                "model_id": "custom",
                "custom_llm": {
                    "url": "https://your-llm-endpoint.example.com/chat/completions",
                    "extra_headers": {
                        "Authorization": "Bearer YOUR_API_KEY"
                    }
                },
                "temperature": 0.5,
            }
        }
    }
)

Pricing

LLM usage is billed per 1 million tokens, with separate rates for input tokens, output tokens, and cached input tokens. Pricing varies by model. Check the XUNA AI pricing page for current rates.

Cached input tokens (e.g., the system prompt portion of your context) are billed at a lower rate than uncached input tokens on supported models. Long, stable system prompts benefit automatically from caching.

Get Started

Configure

Deploy

Monitor & Optimize

Choose and configure a language model for your agent

Supported models

Temperature

Backup LLM

Thinking budget

Reasoning effort (workflow steps)

Custom LLM

Pricing

Get Started

Configure

Deploy

Monitor & Optimize

​Supported models

​Temperature

​Backup LLM

​Thinking budget

​Reasoning effort (workflow steps)

​Custom LLM

​Pricing

Supported models

Temperature

Backup LLM

Thinking budget

Reasoning effort (workflow steps)

Custom LLM

Pricing