Run A/B experiments on prompt, voice, and LLM config

Experiments let you route a slice of real traffic to a variant of your agent configuration — a different prompt, voice, LLM, or workflow — and measure the impact on outcomes before committing to the change. Winners can be promoted to the main branch or rolled back at any time, with full version history preserved. Experiments are built on agent versioning. Enable versioning for your agent before you create your first experiment.

How experiments work

Create a variant

Navigate to your agent’s Branches tab and click Create branch. The branch starts as a copy of your current configuration. Modify anything — system prompt, voice, tools, knowledge base, LLM, evaluation criteria, or language.

Example: testing a shorter system prompt

Branch name: shorter-prompt-v1
Change: Reduced system prompt from 800 words to 300 words,
        focusing on the three most common user intents.

Route traffic to the variant

Click Edit traffic split in the Branches panel. Assign a percentage of conversations to your branch. All percentages across all branches must total exactly 100%.Traffic routing is deterministic based on conversation ID, so the same user always reaches the same branch throughout an experiment. This prevents confusing experiences where a user gets a different agent on each call.

Example traffic split

main:              90%
shorter-prompt-v1: 10%

Measure the impact

Click See analytics from the Branches panel to compare metrics between branches side by side. The analytics view shows the same metrics as the main analytics dashboard.

Metric	What to watch for
CSAT	Did user satisfaction improve or drop?
Containment rate	Is the variant resolving more or fewer conversations?
Conversion	Did the change affect goal completion rates?
Average handling time	Is the variant faster or slower?
Median agent response latency	Did latency change with the new LLM or tools?
Cost per agent resolution	Is the variant more or less expensive to operate?

Give the experiment enough time to accumulate statistically meaningful data before drawing conclusions.

Promote the winner

Once you have a clear result, either increase the traffic share of the winning branch or merge it into the main branch. Full version history is preserved, so you can roll back to any previous state if needed.

What you can test

Configuration area	Examples
System prompt	Tone, length, persona, instructions
Workflow	Conversation flow logic, branching conditions
Voice	Different voices, speaking style, speed
Tools	Adding, removing, or reordering tool calls
Knowledge base	Different document sets or chunking strategies
LLM	GPT vs. Claude vs. Gemini, or different model versions
Evaluation criteria	Testing new success definitions before rolling them out
Language	Comparing performance across locales

Best practices

Start with a hypothesis

Define what you expect to happen and why before you create a branch. For example: “Shortening the system prompt will reduce average handling time without affecting CSAT, because the current prompt contains redundant instructions the model already follows by default.” A hypothesis makes it easier to interpret results and decide whether a change is worth keeping.

Change one thing at a time

If you change the system prompt and the voice in the same branch, you won’t know which change drove the result. Keep each experiment focused on a single variable.

Set up evaluation criteria first

Experiments are most useful when you have evaluation criteria configured before you start. Without success scoring, you can only compare operational metrics like latency and cost — not whether the agent is actually helping users.

Start with small traffic (5–10%)

Route a small percentage of traffic to your variant initially. This limits the impact if the change performs worse than expected, and lets you catch obvious regressions before scaling up.

Give experiments enough time

Don’t conclude an experiment after a handful of conversations. Wait until you have enough data to see consistent patterns — typically at least a few hundred conversations, depending on your traffic volume.

Keep experiments short-lived

Long-running experiments complicate your version history and make it harder to run follow-up tests. Once you have a clear winner, promote it and close the branch.

Common use cases

Prompt optimization

Test a more concise system prompt to reduce cost and latency without sacrificing answer quality.

Voice selection

Compare two voices on the same conversation flow to find which one users respond to better.

LLM comparison

Measure the cost and quality tradeoffs between different language models for your specific use case.

Knowledge base updates

Validate that a new document set improves containment rate before replacing the existing one.

Get Started

Configure

Deploy

Monitor & Optimize

Run A/B experiments on prompt, voice, and LLM config

How experiments work

What you can test

Best practices

Common use cases

Prompt optimization

Voice selection

LLM comparison

Knowledge base updates

Next steps

Analytics

Conversation analysis

Get Started

Configure

Deploy

Monitor & Optimize

​How experiments work

​What you can test

​Best practices

​Common use cases

Prompt optimization

Voice selection

LLM comparison

Knowledge base updates

​Next steps

Analytics

Conversation analysis

How experiments work

What you can test

Best practices

Common use cases

Next steps