Anatomy of a system prompt
A strong system prompt covers four areas:- Identity — Who the agent is and what it represents.
- Goal — The primary task the agent should accomplish.
- Constraints — What the agent should and should not do.
- Tone — How the agent should communicate.
Prioritize Clarity and Brevity
Write instructions using direct, actionable language with minimal excess wording. Focus only on the information required for the agent to perform correctly and avoid repetitive or unnecessary phrasing. Why this improves performance: Short, well-structured instructions are easier for AI systems to interpret consistently. Reducing unnecessary language lowers the chance of confusion, conflicting behaviors, or unintended outputs.Reinforce High-Priority Instructions
Draw attention to mission-critical actions by clearly labeling them as essential or high priority within the prompt structure. Reiterating the most important operational rules in multiple sections can improve consistency and reduce the likelihood of them being ignored. Why this improves reliability: In larger prompts, AI models can shift focus toward newer context as conversations evolve. Strategic emphasis and selective repetition help maintain adherence to the most important instructions throughout execution.Text Formatting & Speech Optimization
Voice synthesis systems generally perform best when processing fully written language instead of raw symbols, numbers, or special characters. Inputs such as “#”, “@”, currency symbols, phone numbers, or numeric strings can sometimes lead to inaccurate pronunciations, distorted speech output, or unintended voice behavior. To improve spoken accuracy, XUNA converts complex text patterns into speech-friendly language before audio generation occurs. For example:2500→ “two thousand five hundred”support@xuna.ai→ “support at xuna dot ai”$99→ “ninety nine dollars”
Available Normalization Modes
XUNA supports configurable normalization behavior through thetext_normalization_type setting inside the agent configuration panel.
system_prompt (Default)
This mode instructs the language model to rewrite symbols, abbreviations, and numerical values into spoken-word format before sending content to the voice engine.
Benefits
- No additional processing delay
- Lightweight and fast
- Works well for most conversational use cases
- AI-generated normalization may occasionally be inconsistent
- Conversation transcripts display fully written phrases instead of original formatting
- Example: “five hundred dollars” instead of “$500”
xuna_normalizer
This mode applies XUNA’s dedicated speech optimization layer after AI generation and before voice rendering.
Benefits
- Higher consistency and pronunciation accuracy
- Preserves natural transcript formatting
- Example: “$500” remains visible in transcripts
- Keeps system prompts cleaner and less instruction-heavy
- Adds a small amount of processing latency before speech playback
Where to Configure This
Inside the XUNA platform, navigate to: Agent Settings → Voice Configuration → Advanced Voice Options From there, locate the text normalization settings near the bottom of the voice configuration panel and select the strategy that best matches your use case.Structured Input Handling for Tools
When thesystem_prompt normalization option is enabled, the AI may convert symbols, special characters, and numeric values into spoken-language equivalents during conversation generation. For example:
john@gmail.commay be interpreted as “john at gmail dot com”404-555-1212may appear as “four zero four five five five one two one two”
Guardrails
Clearly define all mandatory behavioral rules the model must always follow in a dedicated # Guardrails section. Language models are specifically trained to treat this heading as high-priority instruction space, making it more effective for enforcing critical constraints and compliance requirements. This section should include any non-optional rules related to:- Safety and policy compliance
- Restricted or prohibited actions
- Response formatting requirements
- Privacy and security expectations
- Escalation or fallback behavior
- Tool usage limitations
- Brand or tone restrictions
Tool Configuration for Reliability
Agents designed to manage transactional or multi-step workflows become significantly more effective when they have access to external tools. These tools allow the agent to retrieve real-time information, interact with third-party systems, and complete actions on the user’s behalf. Just as important as prompt design is the way tools are configured and described. Well-defined, action-focused tool descriptions help the model:- Select the correct tool for a task
- Supply accurate parameters
- Understand expected outcomes
- Recover more effectively from failures or invalid responses
Describe tools precisely with detailed parameters
When creating a tool, add descriptions to all parameters. This helps the LLM construct tool calls accurately. Tool description: “Looks up customer order status by order ID and returns current status, estimated delivery date, and tracking number.” Parameter descriptions:order_id(required): “The unique order identifier, formatted as written characters (e.g., ‘ORD123456’)”include_history(optional): “If true, returns full order history including status changes”
Explain when and how to use each tool in the system prompt
Clearly define in your system prompt when and how each tool should be used. Don’t rely solely on tool descriptions—provide usage context and sequencing logic.Specify Expected Formats in Tool Parameter Descriptions
When a tool expects structured inputs such as email addresses, phone numbers, account IDs, confirmation codes, or dates, the required format should be clearly defined within the parameter description. Including an example helps the model provide correctly formatted values consistently. This becomes especially important in voice and conversational systems, where speech-to-text normalization may convert structured data into spoken-language forms. For example:"john dot smith at gmail dot com"instead ofjohn.smith@gmail.com"five five five one two three four"instead of5551234
- Email format:
user@example.com - Phone number format:
+1-555-123-4567 - Date format:
YYYY-MM-DD - Order ID format:
ORD-12345
Handle Tool Call Failures Gracefully
External tools may occasionally fail due to network interruptions, invalid parameters, service outages, authentication issues, or missing data. Your system prompt should include explicit recovery instructions so the agent can respond safely and consistently when failures occur. Without clear failure-handling behavior, models may attempt to guess missing information, fabricate successful outcomes, or continue a workflow using incorrect assumptions. Defining recovery behavior helps prevent hallucinations and improves production reliability. Recommended guardrails include:- Never invent tool results when a request fails
- Acknowledge the failure clearly and transparently
- Retry when appropriate and safe to do so
- Request clarification if required inputs are missing or invalid
- Offer fallback actions or alternative paths when possible
- Preserve workflow context so the interaction can continue smoothly after recovery
- “If a tool returns an error, do not fabricate a response.”
- “Explain the failure briefly and ask the user how they would like to proceed.”
- “Retry transient failures once before escalating.”
- “If required data is unavailable, request the missing information explicitly.”
Architecture Patterns for Enterprise Agents
Strong prompts and reliable tools are essential, but enterprise-grade agents also need a well-designed architecture. In production environments, agents often manage workflows that are too complex for a single, all-purpose prompt to handle effectively.Keep Agents Specialized
Avoid giving one agent too many responsibilities. Broad instructions, oversized context windows, and large knowledge scopes can increase latency, reduce accuracy, and make behavior harder to predict. Each agent should have:- A focused purpose
- A clearly defined knowledge base
- A limited set of tools
- Specific success criteria
- Well-scoped responsibilities
Use Dispatcher and Specialist Patterns
Architecture pattern:- Dispatch Agent: Routes incoming requests to appropriate specialist agents based on intent classification
- Functional or Specialist Agents: Handle domain-specific tasks (billing, scheduling, technical support, etc.)
- Human in the Loop : Defined handoff criteria for complex or sensitive cases
Define Clear Handoff Criteria
In multi-agent systems, clearly define when control should transfer between agents or escalate to a human operator or Human-in-the-Loop. Handoff rules should be based on specific conditions such as user intent, task complexity, missing information, failed tool calls, security concerns, or low-confidence responses. Well-defined handoff logic improves workflow reliability, prevents unnecessary loops between agents, and ensures sensitive or high-risk interactions are handled appropriately. It also helps maintain context continuity as requests move across different parts of the system.Large Language Model Selection for Enterprise Reliability
Choosing the right model depends on the specific performance requirements of your application, especially around latency, accuracy, reasoning depth, and tool-calling consistency. Different models provide different tradeoffs between response speed, operational cost, contextual understanding, and workflow reliability. Larger models typically offer stronger reasoning, better instruction adherence, and more reliable tool usage, making them well suited for complex workflows and high-accuracy tasks. Smaller models, however, often provide lower latency and reduced operational costs, making them ideal for high-volume or real-time interactions. Enterprise systems should evaluate models based on:- Response latency requirements
- Tool-calling reliability
- Instruction-following consistency
- Context window needs
- Reasoning complexity
- Cost efficiency at scale
Understand the Tradeoffs
When selecting a model for production systems, it’s important to balance latency, reasoning capability, cost, and tool-calling performance based on the needs of your workflow.- Latency: Smaller models typically respond faster, making them better suited for high-frequency interactions, lightweight workflows, and real-time experiences.
- Accuracy: Larger models generally provide stronger reasoning, better instruction adherence, and improved performance on complex or multi-step tasks, though they often come with increased latency and operational cost.
- Tool-Calling Reliability: Models vary in how consistently they handle structured outputs and function calls. Some models perform well with minimal guidance, while others may require stricter prompting and more explicit parameter definitions to achieve reliable execution.
Model Recommendations by Use Case
Based on large-scale enterprise deployments, different models perform better depending on the balance between latency, reasoning capability, tool-calling reliability, and operational cost.- GPT-4o / GLM 4.5 Air — Balanced Enterprise Performance
Recommended as a strong default for general-purpose enterprise agents where speed, accuracy, and cost efficiency all matter. These models provide reliable tool-calling performance with moderate latency, making them well suited for customer support, scheduling, order management, and general inquiry workflows. - Gemini 2.5 Flash Lite — Ultra-Low Latency Workloads
Best suited for lightweight, high-frequency interactions where response speed is the primary requirement. These models are highly cost-effective at scale and work well for routing, triage, simple FAQs, appointment confirmations, and basic information collection, though they may be less effective for complex reasoning or advanced tool orchestration. - Claude Sonnet 4 / 4.5 — Advanced Reasoning and Orchestration
Designed for complex workflows that require deeper reasoning, nuanced decision-making, and reliable multi-step tool execution. These models typically deliver stronger performance on technically challenging or high-risk tasks, including troubleshooting, compliance-sensitive operations, financial guidance, and advanced escalation handling, with the tradeoff of higher latency and cost.
Benchmark With Your Actual Prompts
Model performance can vary significantly depending on prompt design, workflow complexity, and tool usage patterns. Before selecting a production model, evaluate multiple candidates using the exact prompts and workflows your system will run in production. A reliable benchmarking process should include:- Testing 2–3 candidate models against the same system prompt
- Evaluating real user interactions or high-quality synthetic test cases
- Measuring latency, response accuracy, and tool-calling success rates
- Comparing operational cost against workflow reliability
A/B Testing
Production reliability comes from continuous iteration and structured testing. Even well-designed prompts can fail in real-world scenarios, so long-term performance depends on identifying weaknesses, refining behavior, and validating improvements over time.Configure Evaluation Criteria
Define measurable evaluation criteria for each agent to track performance and detect regressions as workflows evolve. Common metrics include:- Task Completion Rate: Percentage of requests successfully resolved
- Escalation Rate: Percentage of interactions requiring human intervention
- Tool Success Rate: Reliability of tool execution and structured outputs
- Response Accuracy: Consistency and correctness of generated responses
- User Satisfaction: Feedback scores or resolution quality indicators
Analyze Failure Patterns
When an agent underperforms, review failed or low-satisfaction interactions to identify recurring issues and behavioral gaps. Examples:- Incorrect responses → Strengthen or clarify prompt instructions
- Poor intent recognition → Add examples or simplify wording
- Edge-case failures → Introduce additional guardrails
- Frequent tool errors → Improve parameter definitions and recovery logic
Make Targeted Refinements
Avoid broad prompt rewrites whenever possible. Instead, isolate and improve the specific components causing failures. Best practices:- Identify the exact prompt section or tool definition responsible
- Test updates against known failure cases
- Make one change at a time to isolate impact
- Re-run the same evaluations to confirm improvements
- Monitor for unintended regressions after deployment
Configure Data Collection
Configure your agent to capture structured summaries and key metadata from each conversation. Collecting interaction data makes it easier to identify recurring user intents, analyze workflow performance, detect failure patterns, and improve prompts based on real production usage. Useful data points may include:- User intent categories
- Task completion outcomes
- Escalation events
- Tool usage patterns
- Failed interactions
- User satisfaction indicators
- Common edge cases or unsupported requests
Use Simulation for Regression Testing
Before deploying prompt updates to production, test changes against a predefined set of known scenarios to identify regressions early. Simulation-based testing helps verify that new prompt modifications improve behavior without breaking existing workflows or introducing unintended side effects. Your regression test set should include:- Common user requests
- Previously failed interactions
- Edge cases and ambiguous inputs
- Tool-calling workflows
- Escalation scenarios
- Safety and guardrail checks
Production Considerations
Enterprise agents need safeguards that go beyond prompt design. Production systems should include clear error handling, compliance controls, monitoring, and fallback behavior so agents can continue operating safely when workflows fail, tools return incomplete data, or sensitive cases require human review.Handle Errors Across All Tool Integrations
Every external tool or API call introduces a potential point of failure. Production agents should include explicit error-handling instructions to ensure failures are communicated clearly, safely, and consistently. Common failure scenarios include:- Network Failures:
“I’m having trouble connecting to the system right now. Let me try again.” - Missing Data:
“I’m unable to find that information. Please verify the details and try again.” - Timeout Errors:
“The request is taking longer than expected. I can retry or escalate this to a specialist.” - Permission or Access Errors:
“I don’t have access to that information. Let me connect you with someone who can assist further.”
Example Prompts
The following examples demonstrate how the reliability principles covered in this guide can be applied to real-world enterprise workflows. Each example highlights key concepts such as prompt structure, guardrails, tool configuration, escalation handling, and multi-agent coordination to show how reliable production systems are designed in practice.Example 1: Billing and Subscription support - Functional agent
Example 2: Refund and Support - Functional agent
Demonstrated Principles
- ✓ Specialized agent scope focused exclusively on refund handling
- ✓ Structured workflow instructions defined in the
# Goalsection - ✓ Reinforced critical policies such as verification and refund approval limits
- ✓ Clear tool configuration with explicit usage requirements and validation steps
- ✓ Structured parameter formatting guidance for emails and order IDs
- ✓ Dedicated error-handling instructions for tool failures and recovery
- ✓ Clearly defined escalation rules for high-risk or policy-sensitive requests
Formatting Best Practices
Prompt formatting plays a major role in how effectively a language model interprets instructions and prioritizes behavior. Recommended best practices include:- Use Markdown headings: Organize prompts using
#for primary sections and##for subsections - Prefer bulleted lists: Break instructions into smaller, scannable steps for better readability and instruction parsing
- Use whitespace intentionally: Separate sections and logical instruction groups with blank lines to improve structure
- Keep headings in sentence case: Use formats like
# Goalinstead of# GOAL - Maintain consistent formatting: Apply the same heading styles, spacing, and list patterns throughout the prompt
Voice-specific writing tips
System prompts for voice agents differ from those for text chatbots. Keep these in mind:Keep instructions action-oriented
Keep instructions action-oriented
Tell the agent what to do, not what to avoid. “Confirm the customer’s order number before proceeding” is clearer than “Don’t skip order confirmation.”
Use pronunciation hints for tricky terms
Use pronunciation hints for tricky terms
If your brand name or product has unusual pronunciation, spell it phonetically in the prompt. For example: “Our product is called Qwirl (pronounced ‘kwerl’).”
Specify response length
Specify response length
Voice responses should be short. Add an instruction like: “Keep each response to 2-3 sentences. Ask one follow-up question at a time.”
Define escalation behavior explicitly
Define escalation behavior explicitly
Agents work best when they know exactly when to stop trying. State the escalation condition clearly: “If you cannot answer after two attempts, transfer the call using the transfer tool.”
Using dynamic variables in the prompt
You can inject per-session data into the prompt using{{variable_name}} placeholders. These are resolved at conversation start before the agent processes anything.
prompt-with-variables.txt
Prompt size limit
The maximum system prompt size is 2 MB. If you need to supply large amounts of reference information, use the knowledge base instead — it is retrieved dynamically and does not count toward the prompt limit.Setting the prompt via API
Changes to the system prompt take effect immediately for new conversations. Conversations already in progress continue with the prompt that was active at their start.
Frequently Asked Questions
How can I maintain consistency across multiple agents?
Use shared prompt templates for common sections such as guardrails, error handling, formatting standards, and response behavior. Centralizing reusable components helps maintain consistent behavior across specialist agents while simplifying updates and long-term maintenance.What should every production prompt include?
At a minimum, production prompts should define:- The agent’s role or personality
- The primary workflow or objective
- Core guardrails and behavioral restrictions
- Tool instructions and error-handling behavior when tools are used
How should tool deprecation be handled?
Introduce replacement tools before removing existing ones. Update prompts to prioritize the new tool while keeping the older version available as a temporary fallback. Monitor usage and remove deprecated tools only after confirming they are no longer actively used.Do prompts need to change across different models?
Most well-structured prompts transfer effectively across modern models, but model-specific tuning can still improve performance. Differences in reasoning style, latency, and tool-calling behavior may require adjustments to formatting, examples, or instruction clarity.How long should a system prompt be?
There is no fixed limit, but excessively large prompts increase latency, cost, and complexity. Keep prompts focused and intentional. If prompts become too large, consider splitting responsibilities across multiple specialized agents or moving reference material into an external knowledge source.How do I balance consistency with adaptability?
Keep core instructions, goals, and guardrails stable while allowing flexibility in tone and response style based on the user’s behavior or communication style. Conditional instructions can help agents adapt dynamically without compromising reliability.Can prompts be updated after deployment?
Yes. Production prompts should evolve over time as new edge cases, workflows, and failure patterns emerge. Test prompt updates in staging environments before deploying changes to production systems.How can hallucinations be reduced when tools fail?
Include explicit recovery instructions for every tool integration. Reinforce policies such as “never guess or fabricate information” throughout both the guardrails section and tool-specific error handling. Testing failure scenarios during development is critical for validating safe recovery behavior.Next Steps
This guide provides the foundation for building reliable enterprise agents through structured prompting, tool configuration, testing, and architectural design. From here, continue expanding your system with:- Workflow Design: Build multi-agent orchestration and specialist routing logic
- Evaluation Systems: Configure success metrics and performance monitoring
- Data Collection: Capture structured insights from production conversations
- Testing Pipelines: Implement regression testing and simulation workflows
- Guardrails: Strengthen moderation and behavioral safety systems
- Privacy & Compliance: Enforce secure data handling and regulatory requirements
- Case Studies: Analyze real production agent implementations and deployment patterns

