Skip to main content
Experiments let you route a slice of real traffic to a variant of your agent configuration — a different prompt, voice, LLM, or workflow — and measure the impact on outcomes before committing to the change. Winners can be promoted to the main branch or rolled back at any time, with full version history preserved. Experiments are built on agent versioning. Enable versioning for your agent before you create your first experiment.

How experiments work

1

Create a variant

Navigate to your agent’s Branches tab and click Create branch. The branch starts as a copy of your current configuration. Modify anything — system prompt, voice, tools, knowledge base, LLM, evaluation criteria, or language.
Example: testing a shorter system prompt
Branch name: shorter-prompt-v1
Change: Reduced system prompt from 800 words to 300 words,
        focusing on the three most common user intents.
2

Route traffic to the variant

Click Edit traffic split in the Branches panel. Assign a percentage of conversations to your branch. All percentages across all branches must total exactly 100%.Traffic routing is deterministic based on conversation ID, so the same user always reaches the same branch throughout an experiment. This prevents confusing experiences where a user gets a different agent on each call.
Example traffic split
main:              90%
shorter-prompt-v1: 10%
3

Measure the impact

Click See analytics from the Branches panel to compare metrics between branches side by side. The analytics view shows the same metrics as the main analytics dashboard.
MetricWhat to watch for
CSATDid user satisfaction improve or drop?
Containment rateIs the variant resolving more or fewer conversations?
ConversionDid the change affect goal completion rates?
Average handling timeIs the variant faster or slower?
Median agent response latencyDid latency change with the new LLM or tools?
Cost per agent resolutionIs the variant more or less expensive to operate?
Give the experiment enough time to accumulate statistically meaningful data before drawing conclusions.
4

Promote the winner

Once you have a clear result, either increase the traffic share of the winning branch or merge it into the main branch. Full version history is preserved, so you can roll back to any previous state if needed.

What you can test

Configuration areaExamples
System promptTone, length, persona, instructions
WorkflowConversation flow logic, branching conditions
VoiceDifferent voices, speaking style, speed
ToolsAdding, removing, or reordering tool calls
Knowledge baseDifferent document sets or chunking strategies
LLMGPT vs. Claude vs. Gemini, or different model versions
Evaluation criteriaTesting new success definitions before rolling them out
LanguageComparing performance across locales

Best practices

Define what you expect to happen and why before you create a branch. For example: “Shortening the system prompt will reduce average handling time without affecting CSAT, because the current prompt contains redundant instructions the model already follows by default.” A hypothesis makes it easier to interpret results and decide whether a change is worth keeping.
If you change the system prompt and the voice in the same branch, you won’t know which change drove the result. Keep each experiment focused on a single variable.
Experiments are most useful when you have evaluation criteria configured before you start. Without success scoring, you can only compare operational metrics like latency and cost — not whether the agent is actually helping users.
Route a small percentage of traffic to your variant initially. This limits the impact if the change performs worse than expected, and lets you catch obvious regressions before scaling up.
Don’t conclude an experiment after a handful of conversations. Wait until you have enough data to see consistent patterns — typically at least a few hundred conversations, depending on your traffic volume.
Long-running experiments complicate your version history and make it harder to run follow-up tests. Once you have a clear winner, promote it and close the branch.

Common use cases

Prompt optimization

Test a more concise system prompt to reduce cost and latency without sacrificing answer quality.

Voice selection

Compare two voices on the same conversation flow to find which one users respond to better.

LLM comparison

Measure the cost and quality tradeoffs between different language models for your specific use case.

Knowledge base updates

Validate that a new document set improves containment rate before replacing the existing one.

Next steps

Analytics

Understand the metrics you will use to evaluate experiment results.

Conversation analysis

Set up evaluation criteria before running your first experiment.