Use-cases
Features
Internal tools
Product
Guide:
Building an in-app AI agent
Share on
On this page
A practical roadmap for teams building an AI agent (or assistant, or copilot) into their product. That side panel assistant like Notion AI or Breeze AI doing actions for users or explaining what's on screen.
This guide tells you: what to build, what to maintain, what breaks, and what most teams underestimate. We share what we learned building this ourselves at Tandem, so you know what's ahead before you start.
This guide is 100% hand-written. We wrote it because we believe all software will have AI agents users can prompt, whether built in-house or via Tandem. We are on the same mission, and this is just the start of the category. Your feedback to make it better is welcome.
It has 2 levels of reading:
• General: For product (ICs or leaders), founders, CEOs. It covers scope, user expectations, and operational overhead. You can skip the technical callouts without losing the thread.
• Technical: For engineers and CTOs. Collapsed Technical sections throughout go deeper on architecture, performance, and implementation.
Introduction
You are working on the right thing.
ChatGPT has 800 million weekly active users, processing over 2.5 billion prompts per day. Claude has around 20 million users. Everyone is conditioned to prompting.
Your users already type questions into ChatGPT about your product. They paste error messages, ask how to configure features, try to understand dashboards. But ChatGPT fails them: it can't see their screen, doesn't know their account, can't click a button, can't fill a form, has no idea what page they're on. It guesses. They want that same interaction, inside your app, where it can actually do something.
Building a prod-ready, scaling in-app AI agent is harder than it looks. And this is reinforced by the fact that pushing an MVP is surprisingly easy. Here are 2 examples:
Example 1. User types: "fill this form for me." Behind that, the agent must:
See the page: find the form, enumerate fields, required flags, and validation.
Know the user: pull profile, company, and history from backend and memory.
Decide: autofill vs. confirm, and what to ask for.
Act: fill fields and handle async UI (dropdowns, date pickers, autocomplete).
Recover: read validation errors and retry or ask.
Stay fast: seconds matter.
Example 2. On every message, the agent picks what to do: answer from docs? Read the screen? Call an API? Click something? Ask a clarifying question? If you've used Claude or ChatGPT with tools, you've seen it pick wrong: code interpreter instead of .docx, web search when the answer was in the file. Your agent faces that routing decision on every turn, in your customer's product, in front of their users.
What your users expect
Before architecture, before code, what does the user actually want?
You do not need all of this for v1. But you need to know where the bar is, because your users have already been trained by ChatGPT and Claude to expect it. Start with one capability done well. Know that the rest is coming.
What the agent needs to see and know
Your users expect the agent to see what they see. And to know some information about them.
They expect your agent to know:
The current page. Not a description of the page. The actual content: form fields, their current values, buttons and their states (enabled, disabled, loading), error messages, table rows, tabs, modals. If the user is looking at a billing dashboard with an overdue invoice, the agent should know there's an overdue invoice.
Who they are. Their plan, their role, their permissions, their company. The user shouldn't have to re-introduce themselves.
What they've done. If they asked a question yesterday, the agent should remember. If they completed half a setup flow last week, the agent should know where they left off. Context shouldn't reset every session.
Your data. When the user asks "what's my usage this month?", they expect a real number. Not a link to a dashboard. The number.
What the agent needs to do
Users don't separate "answering questions" from "doing things." They expect both in the same interaction.
Answer questions using real context. Answers that reference their data, their page, their situation.
Do things on the page. Fill fields, click buttons, toggle settings, select options. Not instructions. Actions.
Do things in the backend. "Cancel my subscription" shouldn't require navigating to a settings page. The agent should call the API directly.
Navigate for them. "Take me to billing" should take them to billing.
Follow through across pages. "Set up invoicing" might require 4 pages and 12 steps. The user expects the agent to know the whole path, not just the current step.
What the agent should never feel like
The moment your agent recites steps the user can already see on screen, or asks "what page are you on?" when it's embedded in the page, trust is gone and hard to recover. The bar: the agent should feel like a colleague who can see your screen and act on it, not a search engine that talks.
Speed is the other killer. A 3 second response feels broken. A 5 second response and they've already closed the chat. Your agent is competing with "I'll just figure it out myself," and that takes about 4 seconds of patience.
What the agent needs to access
The agent is only as useful as the context it has. Here's every source of context and what makes each one hard.
Current page context
Without page context, the agent is blind. It can't answer "what's this error?" if it can't see the error. It can't fill a form if it can't find the fields.
The agent runs on a live web page. It sees the DOM, not your source code. Not your component tree. The rendered page, same as the browser.
Finding elements. To act on a button, the agent needs to find it. That means generating a selector that survives the next deploy, the next A/B test, the next design system update. CSS classes change. IDs get removed. DOM structure shifts. The agent needs selectors robust enough to handle this, and fallbacks when they break: matching by visible text, by ARIA attributes, by position relative to other elements.
Handling invisible elements. Not everything in the DOM is real. Elements hidden by CSS, behind a modal, not yet rendered, inside collapsed sections, off-screen. The agent must filter these out. Acting on a hidden element is worse than acting on nothing.
Dynamic content. Forms don't render all at once. A field appears after a dropdown selection. A section loads via AJAX 2 seconds after the page. A button becomes enabled after validation. The agent needs to watch the DOM in real time, detecting when elements appear, disappear, or change state. All while running on a customer's production page without degrading performance.
Disambiguation. A table with 20 rows has 20 "Edit" buttons. When the user says "click Edit," which one? The agent must detect repeating patterns, determine whether matched elements are interchangeable (dropdown options) or distinct (different row actions), and pick the right one using contextual signals: what the user just interacted with, spatial proximity, which section is currently visible. This is a scoring problem, not a lookup.
Technical details: IFrames, DOM Observation, Form state tracking
Iframes. Many enterprise apps embed content in iframes. The agent must reach inside them, handle cross-origin restrictions, and coordinate actions across frame boundaries.
DOM observation. Typically uses MutationObserver for structural changes and polling for computed style changes. The performance budget is tight: a badly tuned observer causes visible jank on the host page. Debounce aggressively, batch mutations, and never read layout properties (offsetHeight, getBoundingClientRect) inside a mutation callback.
Form state tracking. The agent must distinguish pre-populated defaults from values the user or agent entered. This matters for deciding what to fill and what to leave alone.
Why not just take a screenshot?
Some agents use screenshots instead of reading the DOM. Send a screenshot to a vision model, let it figure out what's on screen. It works for demos. It doesn't scale. You get pixel coordinates, not a selector, and those break the moment the window resizes. Every screenshot is a vision model call (expensive, slow). And you lose structured information: a screenshot can't tell you a field is required, that a value failed validation, or that a dropdown has 47 options. DOM reading is harder to build. It's the only approach that works in production.
User data and history
The agent that forgets who you are between sessions is just a search bar with extra steps.
Identity and attributes. Plan, role, permissions, company, segment. Structured data you already have. You pass it through your integration script. The straightforward part.
Conversation history. What the user asked before, what the agent did, what worked, what didn't. This grows every interaction. You can't send all of it to the model each turn. But cutting it naively loses critical context: the user mentioned their goal three messages ago and now the agent asks again. You need compression that keeps decisions and outcomes and drops the mechanics.
Session state. Where the user is in a workflow. What they completed. What they skipped. This state must persist across page transitions, tab closures, and session breaks. A user who left mid-setup on Tuesday and returns Thursday expects the agent to know exactly where they stopped.
Knowledge base
Users ask questions the page can't answer. "How do I set up SSO?" "What's the difference between the Pro and Enterprise plan?" The agent needs a source of truth beyond the screen.
Help docs, FAQs, product guides. You index them, embed them, retrieve relevant chunks when the user asks. This is RAG, and it's a solved problem at the infrastructure level. Services like Pinecone, Weaviate, AWS Bedrock Knowledge Bases, or OpenAI's built-in retrieval handle the core pipeline.
The real work is in two places.
Content management. Your docs change. Features get renamed. Pricing pages get updated. You need a pipeline that re-indexes when content changes, not a one-time import. If your knowledge base is a static export from six months ago, the agent gives confident wrong answers. Worse than no answer.
Knowing when to use it. The knowledge base is one source among several. The agent also has the current page, the user's data, its own reasoning. When the knowledge base says one thing and the page shows something different (because the docs are stale), which wins? This priority logic lives in the orchestration layer, not in the retrieval system. We cover it in section 04.
Retrieval quality matters at least as much as model quality for knowledge-grounded answers. Bad retrieval feeds wrong chunks to the model, which produces a confident wrong answer.
Technical details: Retrieval pipeline
Chunk your docs by semantic section, not fixed token count. Embed with a model trained on retrieval tasks, not a general-purpose embedder. Rerank results before injecting into context. Test retrieval quality independently from generation quality. A common mistake: evaluating the final answer without checking whether the right chunks were retrieved in the first place. If retrieval is wrong, no model will save you.
Backend data (APIs, MCP)
When the user asks "what's my usage this month?", they expect a number. That means the agent needs to call your backend and return real data. Two approaches.
Direct API integration. The agent calls your existing endpoints. You define which ones it can use, handle auth, and map them to tool definitions the LLM can invoke. Simple and fast to ship. The limitation: every new capability means wiring up another endpoint manually.
MCP (Model Context Protocol). An open standard, originally from Anthropic, now under the Linux Foundation. MCP is a standardized wrapper around your APIs. You build an MCP server (a lightweight service, typically a few hundred lines of code in Python or TypeScript) that describes your capabilities as "tools" the agent can discover and call at runtime. OpenAI and Google have announced support; adoption is growing but still early in production.
The practical difference: direct integration is simpler when you have a small, fixed set of actions. MCP pays off when you have many capabilities, when you want the agent to discover tools dynamically, or when you want the same integration to work across different AI platforms.
Either way: the agent makes authenticated requests to your backend on behalf of the user. Auth tokens must flow through your agent's backend layer. The LLM never sees raw credentials. Every request needs proper scoping.
One thing to decide early: what can the agent do? Read-only queries ("what's my usage?") are low risk. Write operations ("cancel my subscription," "add a team member") need confirmation flows, audit logs, and clear boundaries on what's automated versus what requires user approval.
Security: where the data flows
What does the LLM see? Every piece of context you send to the model is processed by a third-party LLM provider. You need to know exactly what's being sent, log it, and ensure it complies with your data handling policies.
What stays client-side? DOM observation, element finding, action execution can run entirely in the browser. Minimize what gets sent to the server. Send descriptions of page state, not raw DOM trees. Summarize user data, don't dump it.
Authentication. The agent needs the user's auth context (tokens, session), but the LLM should never see raw credentials. Auth flows must be handled in your agent's backend layer, not in prompts.
PII in conversations. Users will type personal information into the chat. They'll paste error messages with identifiers, share screenshots. Your conversation storage and LLM interactions need to account for this.
What the agent needs to do
Front-end actions
The user said "do it." Now the agent needs to actually interact with the page.
Click. Buttons, links, tabs, checkboxes, menus. Dispatching a click event sounds trivial. Some frameworks intercept native events. Some buttons need a mousedown, mouseup, click sequence. Some require a hover first to reveal a dropdown. The agent simulates realistic interaction patterns, not just fires events.
Fill. Text inputs, dropdowns, date pickers, file uploads, rich text editors. Each has its own input method. A dropdown might be a native
<select>, a custom component that opens a searchable list, or a combobox that filters on type. The agent detects the type and adapts.Fuzzy matching. The user says "France." The dropdown shows "FR - France." Or "🇫🇷 France (Metropolitan)." Or the options haven't loaded yet. The agent matches intent to available options with tolerance for format differences, abbreviations, and async loading.
Navigate. Go to the right page. URL changes in traditional apps, route transitions in SPAs where the URL updates but the page doesn't reload, plus redirects, auth gates, and loading states in between.
Validate. After acting, the agent checks its own work. Did the field accept the value? Did a validation error appear? Acting without verifying is how agents create more problems than they solve.
Each action involves latency. The agent clicks, then waits for the page to react. A dropdown opens, options load asynchronously, the agent selects one, a dependent field appears, the agent fills that too. Every step is a wait, check, act cycle. Chain five of those together and timing becomes the real engineering problem.
Back-end actions
Users don't always want to watch the agent click through five pages. "Add a new team member" shouldn't mean: navigate to Settings, click Team, click Invite, fill the email field, select a role, click Send. It should mean: call the invitation API, confirm with the user, done.
Backend actions skip the DOM entirely. The agent calls your APIs or MCP servers to create, update, or fetch data directly. Faster, more reliable, no selector breakage.
What this requires:
An API surface the agent can call. Either your existing endpoints with proper scoping, or a dedicated agent API. Auth handling: the agent makes requests on behalf of the user, using their session context, without the LLM ever seeing raw credentials.
MCP servers, or direct integration. MCP (covered in section 02) lets agents discover and call your backend tools through a standardized interface.
Trade-offs:
API vs MCP: MCP is more flexible and portable. The alternative is direct API integration, where you wire each endpoint manually. Direct integration is simpler to start but harder to maintain as capabilities grow. Either way, you're giving the agent authenticated access to your backend, so security review matters early.
Clear rules for when to use backend vs. front-end. The user asks to change their email. Does the agent call the update API silently, or navigate to the profile page and fill the field so the user sees it? Backend is faster. Front-end builds trust and teaches the UI. You need a deliberate policy per action type, not a default.
Resources on MCP and backend integration
MCP official spec and docs: modelcontextprotocol.io/introduction covers the protocol, architecture, and concepts. SDKs available in Python, TypeScript, Java, C#, Go, and others.
Build your first MCP server: modelcontextprotocol.io/docs/develop/build-server is the official quickstart.
OpenAI's MCP integration guide: platform.openai.com/docs/mcp covers building MCP servers for ChatGPT Apps and API integrations.
Pre-built MCP servers: github.com/modelcontextprotocol/servers lists hundreds of community and official servers you can reference as patterns.
Why you need both
Front-end actions are visible. The user sees the agent clicking, typing, navigating. This builds trust and teaches them the product. It's also the only option for UI flows you don't control via API.
Backend actions are reliable. No DOM fragility, no timing issues, no selector breakage. Faster and deterministic.
The best agents blend both. Use backend actions for data operations (fetch, create, update). Use front-end actions for guided walkthroughs where the user needs to learn the UI. Let the agent decide based on context.
A note on testability
Every action the agent takes is something you'll need to reproduce in a test environment. Think about this now, not after launch. For front-end actions: record page snapshots during development. They become your test fixtures. For backend actions: mock your APIs from the start. The agent calling a live API in a test environment that actually cancels a subscription is a mistake you only make once. For the routing layer: log every decision with its full input context. This is your regression dataset.
Escalation and handoff
The agent will hit limits. What matters is what happens next.
"User needs help" is useless to your support team. "User tried to configure SSO, the agent filled the SAML endpoint field, validation failed with 'invalid URL format', user has been trying for 3 minutes" saves everyone 10 minutes of back-and-forth.
What gets passed along: The full conversation transcript. What the agent tried and what failed. Current page and URL. User attributes (plan, role, segment). Any error messages. The user shouldn't repeat anything.
When to escalate: Some topics always go to a human: billing disputes, account deletion, legal questions. Some patterns trigger it automatically: the agent failed the same action twice, the user said "talk to someone," confidence dropped below a threshold. These rules need to be configurable, not buried in code.
Where it goes: The handoff should create a ticket or conversation in your existing support system, with full context attached. Even better: keep the handoff inside the same chat window. The user doesn't get bounced to a different interface. A human just joins the conversation.
Workflows across pages
Real tasks don't live on one page. "Set up invoicing" means: create a company profile in Settings, add a payment method in Billing, configure a template in Templates, send a first invoice in Invoices. Four pages, twelve steps, and the user expects the agent to know the whole path.
The path is not a list. It's a graph. Some steps depend on earlier choices. Enterprise users skip the payment page. Users in certain countries see an extra tax configuration step. Some steps repeat: "add another team member" loops back. The agent needs to tell the difference between a loop and a cycle, between a user going back to fix something and a user being stuck.
Users leave. They close the tab mid-flow, come back tomorrow. The agent should know where they left off. State has to survive page loads, SPA transitions, and session breaks.c
Technical details: SPA detection
SPA detection is non-trivial. You can't rely on popstate or hashchange events alone. A robust approach combines URL polling, History API interception, and MutationObserver on the document body to detect full page swaps. Track navigation intent separately from navigation outcome: the user clicked a link, but did the page actually change?
When the agent fails
The agent will fail. A selector breaks. The API returns an error. The model picks the wrong tool. What happens next determines whether users trust the agent or abandon it.
Acknowledge, don't hide. "I tried to click the Save button but couldn't find it on this page" is far better than silence or a vague "Something went wrong." Users forgive mistakes. They don't forgive agents that pretend nothing happened.
Show what was attempted. If the agent tried two approaches and both failed, say so. This gives the user enough to decide what to do next, and gives your support team context if the conversation escalates.
Offer a next step, always. Even when the agent can't complete the task, it should propose something: a different approach, or a handoff to support with everything it's tried so far. Dead ends kill trust faster than errors do.
Undo and rollback. For actions the agent already took before failing mid-workflow, the user needs to know what changed and whether it can be reversed.
How the agent decides and what controls it
You have the context. You have the actions. Now the hard part: deciding what to do on every single turn, and controlling what the model sees while it decides.
The routing problem
Every user message requires a decision.
Act or ask? "Fill this form" could mean fill it immediately or confirm first. Auto-filling a search form: go ahead. Auto-submitting a payment form: ask first. The agent needs rules, and the rules need context.
Which tool? The agent has multiple capabilities: read the page, search knowledge, call an API, navigate, fill a field. For any given request, several could apply. "What's my usage?" could be answered from the dashboard (read the page), from your API (fetch data), or from the knowledge base (explain what "usage" means). The agent picks one. It picks wrong more often than you'd expect.
Which source? When answering a question, should the agent prioritize page content, knowledge base, API, or its own reasoning? These sources can contradict each other. The knowledge base says one thing, the page shows another (because the docs are outdated). The agent needs a priority order, and the order changes depending on the question type.
Chat or screen? The user asks "change my email." Should the agent update via API and confirm in chat? Or navigate to the profile page and fill the new email? Chat is faster. Screen teaches the user where to find it. You need an opinion on this, and the agent needs to reflect it.
Even the most capable models regularly choose the wrong action. You've seen this yourself: ask Claude to help with a document and it opens a code interpreter instead of writing a .docx. Same problem for your agent. This is a routing problem, not an intelligence problem.
Model differences matter here. Claude, GPT-4, and Gemini behave meaningfully differently in tool selection, instruction following, and how they handle ambiguity. Claude tends to be more conservative and literal with tool calls. GPT-4 is more aggressive about acting but sometimes overreaches. Gemini handles long context well but can be inconsistent with structured tool output. Your routing logic and prompts will need tuning per model. Swapping a model is never just changing an API key.
The conversation
Streaming. Users won't wait. You need real-time token streaming end-to-end: model to backend to UI.
Memory. History grows fast. You can't send everything each turn, but naive truncation loses key facts. Compress by keeping decisions and outcomes and dropping the mechanics.
Recovery. The agent must detect mistakes, tell transient issues from real breakage, and adapt instead of blindly retrying. A useful trick: add a no-op planning step (like Claude Code's TodoWrite) that forces the agent to state its plan and use it to reorient after errors.
Context engineering
The model has a limited context window. Everything the agent knows (page state, user data, conversation history, knowledge base results, system instructions, tool definitions) competes for that space.
Giving it everything does not work. Google's pitch is "2 million token context windows, let the model figure it out." Every team actually shipping agents has found that performance degrades well before hitting the token limit. Research from Liu et al. (Lost in the Middle, TACL 2024) shows a U-shaped attention curve: models perform best when relevant information is at the beginning or end of the input. Information in the middle gets lost, even with long-context models.
The pattern that works is progressive disclosure. Give the model only what it needs right now, load more on demand. Start with the minimum: system instructions, user message, current page summary. Then load context on demand. The agent decides to search docs, results get injected. The agent needs page detail, specific element descriptions are loaded. Compress aggressively: old turns get summarized, page state gets reduced to what's relevant.
The numbers back this up. Cursor's lazy tool loading (Dynamic Context Discovery, Jan 2026) reduced token usage by 46.9% in runs that called MCP tools. Vercel (removed 80% of their agent's tools) and watched token usage drop from 145,000 to 67,000, steps from 100 to 19, latency from 724 to 141 seconds. The agent went from failing to succeeding.
Technical details: Core agent loop
The core agent loop is simple: while (model returns tool calls): execute tool, capture result, append to context, call model again. Claude Code, Cursor, Manus all run this loop.
The engineering is in what surrounds it: what context enters the loop, how tool results are processed, what gets compressed, what gets injected as reminders after every tool call.
Anthropic found that repeating instructions after every tool execution achieves higher adherence than system-prompt-only instructions. For KV-cache: changing tool definitions at the front of the context invalidates cache for all subsequent tokens. Some systems keep all tools permanently loaded and control availability by constraining output probabilities. Others lazy-load on demand. The right approach depends on your token economics.
Speed, cost, reliability
Every user interaction involves at least one LLM call. Often several.
A typical latency breakdown for a single agent turn:
Step
Typical range
DOM read and page state extraction
Post-action verification (DOM settled, validation check)
Context assembly (history, user data, compression)
20–100ms
Network round-trip to LLM provider
100–300ms
LLM time-to-first-token
200–800ms (fast model) / 1–4s (reasoning model)
LLM full generation
500ms–3s
Action execution on page (click, fill, navigate)
100–500ms per action
Post-action verification (DOM settled, validation check)
200–500ms
A simple Q&A turn can hit sub-2 seconds. A multi-step form fill chains 3 to 5 actions, easily pushing to 5 to 10 seconds total.
Usually, the biggest wins are in: model selection (use a fast model for execution, reserve reasoning models for planning), context size (less context = faster inference), and dropping the thinking mode.
Cost. Every interaction costs tokens. A complex multi-step workflow might make 10 to 20 LLM calls. At scale, this adds up fast. Progressive disclosure is both a performance optimization and a cost control mechanism.
Reliability. Your agent depends on an LLM provider. A single-provider architecture means every OpenAI or Anthropic incident becomes your incident. If you're shipping to production at scale, multi-provider routing with automatic failover matters. For v1, pick one provider and move fast. But plan for multi-provider early.
Harness engineering
Your users don't care which model you use. They care if the agent works. And whether it works depends far more on the scaffolding around the model than on the model itself.
The scaffolding (the "harness") is everything that wraps the LLM: the execution loop, tool definitions, context assembly, error recovery, information flow. The model decides what to do. The harness decides what the model can see, what tools it can use, and what happens when it fails.
The evidence is consistent. On CORE-Bench (Kapoor et al.), Claude Opus 4.5 scored 42% with one scaffold and 78% with another. The only variable was the harness. LangChain (Improving Deep Agents, Feb 2026) improved their coding agent from 52.8% to 66.5% on Terminal Bench 2.0 by changing only prompts, tools, and middleware. Zero model changes. Top 30 to Top 5. Manus rebuilt their agent framework five times. Each rewrite removed complexity. Their biggest performance gains came from removing things (Manus: Context Engineering for AI Agents).
The model is the engine. The harness is the car.
One pattern worth internalizing: the harness should get simpler over time, not more complex. As models improve, they need less scaffolding. Fewer hardcoded rules, less prompt engineering, more general instructions. Anthropic, Cursor, and Manus all report this. If your agent keeps getting more complicated while models get better, something is off. (Anthropic: Effective Harnesses for Long-Running Agents, Phil Schmid: Context Engineering Part 2)
Prompt engineering
Your agent’s behavior is defined by prompts. System prompts, tool descriptions, playbook instructions. Every word is a product decision. And the most common mistake is writing too much.
Keep it simple. Smart models are their own worst enemy. Claude, GPT-4, Gemini follow instructions extremely literally. Long prompts create contradictions, and you get inconsistent behavior because the model is alternating between conflicting rules. Start with ~10 lines. Add only when something specific breaks. Every new line can conflict with every other line.
Use LLMs to write your prompts. Paste your system prompt into Claude or Gemini and ask: “What’s ambiguous or contradictory? What would confuse you?” Gemini is good at auditing long prompts; Claude is good at rewriting them to be shorter and clearer.
Reasoning models need less instruction, not more. OpenAI’s prompt engineering guide nails this: treat reasoning models like senior engineers (goal + constraints), standard models like juniors (more step-by-step). If your prompt is 2000 tokens of micro-rules, you’re usually hurting performance.
Few-shot examples beat long explanations. Instead of paragraphs of formatting rules, show 1–3 canonical examples. Anthropic recommends this in their context engineering guide: anthropic.com/engineering/effective-context-engineering-for-ai-agents. Examples don’t contradict themselves; rules do.
Test your prompt the dumb way first. Give the system prompt to a colleague. If they can’t quickly explain what the agent should do, the model won’t either. Then automate with tools like LangSmith or Langfuse to track regressions over time.
Resources worth reading
Anthropic's prompting docs: docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview
Anthropic on context engineering for agents: anthropic.com/engineering/effective-context-engineering-for-ai-agents
OpenAI's prompt engineering guide: developers.openai.com/api/docs/guides/prompt-engineering
PromptHub's agent prompting guide: prompthub.us/blog/prompt-engineering-for-ai-agents — 20+ real system prompts from open-source agents like Bolt and Cline.
Prompting Guide: promptingguide.ai — comprehensive reference for techniques.
Testing the decision layer
The routing layer is the hardest part of the agent to test and the most likely to regress.
From day one, capture page state snapshots alongside agent decisions. Every time the agent picks a tool, log: what tools were available, what the user said, what page context was present, what the agent chose, and whether it was correct. This gives you a regression dataset that grows with usage.
You can't unit test "did the agent pick the right button?" without reproducing the exact page state and conversation history. Build the infrastructure to capture and replay those states early. Retrofitting this onto a running agent is painful. Starting with it is cheap.
Building and controlling it
A runtime agent is useless without a way to create, configure, and maintain what it does. Without writing code.
Authoring: visual builder
The people who know what the agent should do (product managers, support leads, customer success) are rarely the people who can write code. If every new workflow or change requires an engineer, the agent becomes a permanent line item on your engineering roadmap. Every tweak waits for a sprint.
Option 1: Engineers build everything. Every workflow is code. Every change is a pull request. This works if you have 5 workflows. It stops working at 50.
Option 2: A visual builder. Non-engineers create and edit workflows directly. Faster iteration, more coverage, fewer bottlenecks. But now you need to build the builder, which is its own product.
If you go with a builder, here's what it needs:
Define steps by interacting with the live page. The admin points at a button on the actual website, clicks it, and that becomes a step. The system generates a robust selector behind the scenes.
Multi-page flow editing. A visual graph: pages as nodes, transitions as edges. Drag to reorder. Add branches. Assign navigation actions to each transition.
Versioning. Draft and live, always. Edits in draft never touch production until published.
Breakage detection. Every change to the customer's website can break existing workflows. A renamed CSS class. A reorganized nav. A redesigned form. The builder must surface broken selectors proactively, not silently fail in production.
Controlling: tone, rules, and guardrails
The agent speaks on behalf of your product. Every message it sends is your brand talking. If it sounds generic, robotic, or off-tone, users notice. If it says something wrong about pricing or policy, that's your company saying it.
You need control over:
Branding and tone. Formal or casual? Short or detailed? Configurable per product, per segment, per page. Your onboarding agent can be warm. Your billing agent should be precise. Both should sound like your company, not like a generic AI.
Playbooks. Structured instructions for specific scenarios. "When a user asks about pricing, respond with these talking points and link to the pricing page." Playbooks are how domain knowledge gets into the agent without fine-tuning a model. They're also how you keep the agent consistent across thousands of conversations.
Guardrails. What the agent is allowed to do and not do. Can it submit forms without confirmation? Can it call destructive APIs? Can it discuss competitor products? These boundaries need to be explicit and enforceable, not implied by prompt phrasing.
Escalation rules. What topics always go to a human. What failure patterns trigger handoff. What context gets passed along. Configurable, not hardcoded.
Conditions and segments. Free users get basic guidance. Enterprise users get white-glove interactions. Admins see configuration help. Different behaviors for different user types, configurable at the workflow level.
Language. If your users are multilingual, the agent detects language from input and responds accordingly. Your knowledge base content, playbooks, and element references all need localization.
Every change you make to tone, playbooks, or guardrails is effectively a prompt change. And prompt changes ripple. A tweak to make the agent "more proactive" on one page can make it overstep on every other page. You need to test changes against your full set of scenarios, not just the one you're fixing.
Operating it
Building the agent is a project. Operating it is a permanent job.
Measuring
If you can't see what the agent is doing, you can't know if it's working.
Log everything. Every agent session, every conversation, every decision. What context was available, what tool was selected, what action was taken, what happened next. Without it, "it doesn't work on our settings page" gives you nothing to investigate.
Three layers of measurement:
Agent-level metrics. Completion rates, drop-off points, error rates per step, latency per interaction, cost per session. And what users actually type into the agent. That last one is underrated: it reveals gaps in your product, not just in the agent. You'll find 30% of messages are the same 5 questions, and another 30% are things the agent can't handle yet. For agent-specific tracing: LangSmith, LangWatch, or Langfuse capture full interaction trajectories and run automated evaluations.
Product-level analytics. Did activation improve? Did time-to-value drop? Are users reaching features they weren't reaching before? Are support tickets going down? Tools like Amplitude, Mixpanel, or PostHog track funnels, retention, and feature adoption. PostHog also includes LLM-specific analytics.
Session recordings. You want to see what actually happened. Not just the agent's logs, but the user's full experience. FullStory, LogRocket, or PostHog's built-in replay. When the agent clicks the wrong button, a session recording shows you exactly what the user experienced.
Show value to leadership. Before launch, pick one metric that maps to what your product team already tracks. Activation lift. Time-to-value reduction. Ticket deflection rate. Instrument it from day one. When someone asks "is the agent working?", you need a number, not a story.
Technical detail: Build your metrics pipeline as a first-class system.
You'll need agent tracing for debugging, product analytics for business impact, and session replay for qualitative investigation. Keep them connected: when you spot a drop-off in your funnel (Amplitude/Mixpanel), you should be able to jump to the agent trace (LangSmith/Langfuse) and the session recording (FullStory/LogRocket) for the same user.
Maintaining
The agent doesn't stay working by itself. Three things break it constantly.
Prompt ripple effects. You update a system prompt to handle an edge case. It fixes that case. It breaks 4 others you didn't test. A single prompt change touches every conversation, every workflow, every page. There is no such thing as a "small prompt change."
Model updates. Your LLM provider releases a new version. Your agent's behavior shifts. Subtly. Phrasing changes. Tool selection patterns drift. An edge case that worked before now fails.
Customer app changes. A page gets redesigned. CSS classes change. DOM structure shifts. Selectors break. Workflows fail silently. Users hit broken experiences before you know something changed.
Eval pipelines. Traditional software has deterministic tests. An LLM agent acting on live DOM does not. What you need: a library of real scenarios. For each one, save the page state, the user message, the conversation history, the expected agent behavior. Every time you change a prompt, update a model, or a customer page changes, replay these scenarios. Compare decisions against expected outcomes. Automate this. Run it on every change.
Without it, you find regressions when users report them. With it, you find them before deploy. Building this pipeline is a standalone engineering investment. Most teams underestimate it until a production incident forces the build.
Improving
You want to release something and then get better. Not release and move on.
Collect bad interactions. Thumbs up/down on every agent response. Simple, low friction. Pair it with full session logs: the user message, the page state, the agent's decision, the outcome. That combination tells you what went wrong and why.
Someone reviews them. Weekly. Pull the thumbs-down sessions. Categorize: wrong tool selected? Bad knowledge base answer? Broken selector? Missing capability? Each category has a different fix.
Then update accordingly. Wrong tool selection → adjust routing logic or prompt. Bad KB answer → update the article or re-chunk it. Broken selector → fix in the builder. Missing capability → add to the roadmap. Then replay your scenario library to make sure the fix didn't break something else.
This is a loop. Continuous. The agent's quality is the quality of this loop.
Upgrading
New models release constantly. Better reasoning, larger context, lower cost. Each one is an opportunity and a risk.
Regression test first. Run your scenario library against the new model. Compare completion rates, tool selection accuracy, response quality, latency, cost. Don't switch in production until you've validated against your specific workflows.
Multi-model from the start. You'll end up using multiple models: a fast one for simple responses, a capable one for complex reasoning, a specialized one for embeddings. Build for this from day one. Single-model architectures become bottlenecks the moment you need to optimize for speed, cost, or capability separately.
The harness should get simpler. As models improve, they need less scaffolding. If a model upgrade requires you to add complexity, something is off. Plan for simplification.
Conclusion
Building an in-app AI agent is building a product inside your product. It has its own users, its own bugs, its own metrics, and its own operational overhead. The AI model is a component. An important one, but a component. The rest is engineering.
Your users are already prompting ChatGPT about your product. They're just getting bad answers because ChatGPT can't see their screen, doesn't know their account, and can't do anything. The opportunity is giving them that same interaction where it actually works: inside your app, with real context, real actions, real data.
Start with one page or flow. Make it work reliably. Then expand. We'll keep updating this guide as the category evolves. If you want to talk about what you're building, or if you'd rather not build all of this yourself, reach out.