Skip to main content
The language model (LLM) you choose determines response quality, latency, capability, and cost. XUNA AI Conversational AI supports models from OpenAI, Google, Anthropic, and XUNA AI-hosted open models, as well as custom endpoints you host yourself.

Supported models

ProviderModel
XUNA AIGLM-4.5-Air, Qwen3-30B-A3B, GPT-OSS-120B
GoogleGemini 3 Pro Preview, Gemini 3 Flash Preview, Gemini 2.5 Flash, Gemini 2.5 Flash Lite, Gemini 2.0 Flash, Gemini 2.0 Flash Lite
OpenAIGPT-5, GPT-5 Mini, GPT-5 Nano, GPT-4.1, GPT-4.1 Mini, GPT-4.1 Nano, GPT-4o, GPT-4o Mini, GPT-4 Turbo, GPT-3.5 Turbo
AnthropicClaude Sonnet 4.5, Claude Sonnet 4, Claude Haiku 4.5, Claude 3.7 Sonnet, Claude 3.5 Sonnet, Claude 3 Haiku
Certain models are available for HIPAA-compliant environments. Contact XUNA AI support to confirm which models are covered under your Business Associate Agreement.

Temperature

Temperature controls how creative or conservative the model’s responses are. Lower values produce more predictable, factual responses. Higher values produce more varied, creative ones.
RangeLabelBest for
0.0 – 0.3LowCustomer support, FAQ bots, tasks requiring accuracy
0.4 – 0.7MediumGeneral-purpose assistants, balanced creativity and accuracy
0.8 – 1.0HighCreative applications, brainstorming, entertainment
Start at medium (0.5) and adjust based on how the agent responds in testing.

Backup LLM

If the primary model is unavailable, the backup LLM takes over to keep conversations running.
OptionBehavior
DefaultXUNA AI automatically selects a reliable fallback model
CustomYou specify which model to use as the fallback
DisabledNo fallback — conversation fails if the primary model is unavailable
Disabling the backup LLM is strongly discouraged for production agents. A model outage will immediately interrupt live conversations with no recovery path.

Thinking budget

The thinking budget controls how much reasoning a model performs before generating a response. Models with extended thinking enabled can handle more complex, multi-step tasks.
SettingBehavior
DisabledNo extended thinking; fastest responses
LowLight reasoning; suitable for most conversational tasks
MediumModerate reasoning; good for structured data tasks
HighDeep reasoning; best for complex analysis or workflows
Extended thinking adds latency. For real-time voice conversations, keep this low or disabled.

Reasoning effort (workflow steps)

For agents that use visual workflow steps (rather than single-turn conversation), you can set the reasoning effort per step.
SettingWhen to use
NoneReal-time conversation nodes — always use this to minimize latency
LowSimple classification or routing steps
MediumMulti-condition logic or data transformation
HighComplex analysis, synthesis, or decision-making steps
Do not enable reasoning effort for real-time conversation nodes. It introduces significant latency that degrades the voice experience.

Custom LLM

If you want to use a model not on the supported list — including fine-tuned or self-hosted models — you can point the agent at your own endpoint.
1

Expose an OpenAI-compatible endpoint

Your endpoint must implement the OpenAI Chat Completions API contract (POST /chat/completions).
2

Configure the agent

Provide the endpoint URL and any required credentials.
from xuna_ai import XunaAI

client = XunaAI()

agent = client.conversational_ai.agents.update(
    agent_id="your-agent-id",
    conversation_config={
        "agent": {
            "llm": {
                "model_id": "custom",
                "custom_llm": {
                    "url": "https://your-llm-endpoint.example.com/chat/completions",
                    "extra_headers": {
                        "Authorization": "Bearer YOUR_API_KEY"
                    }
                },
                "temperature": 0.5,
            }
        }
    }
)

Pricing

LLM usage is billed per 1 million tokens, with separate rates for input tokens, output tokens, and cached input tokens. Pricing varies by model. Check the XUNA AI pricing page for current rates.
Cached input tokens (e.g., the system prompt portion of your context) are billed at a lower rate than uncached input tokens on supported models. Long, stable system prompts benefit automatically from caching.