AI Support Sandbox for Regulated Teams
The Constraint
Support volume tripled in Q3 2024. Our agents were drowning in repetitive triage tasks, but we operate in healthtechâdumping records into a chatbot was not an option. We needed AI assistance with zero chance of leaking PHI, breaking audit trails, or confusing regulatory reviewers. The answer was a sandbox: an isolated environment where AI models generate suggestions, humans stay in control, and every action is logged. This write-up explains how we designed that sandbox, the guardrails that keep it safe, and the rollout plan that made adoption painless.
High-Level Architecture
- Ticket Intake. Support tickets flow into Zendesk. A webhook emits redacted payloads (conversation summary, metadata, anonymized identifiers) to our sandbox API.
- Sandbox Service. A Cloud Run service fetches full context from Supabase (under strict row-level security), assembles it with playbook instructions, and calls OpenAIâs GPT-4o mini model with bounded prompts.
- Human Review UI. Agents see AI-generated summaries, suggested responses, and compliance call-outs inside a Notion-based desk. They can accept, edit, or reject with a single shortcut.
- Audit + Feedback Loop. Every AI suggestion and human action is logged in BigQuery. We run weekly evaluations and feed agent corrections back into the prompt library.
All secrets stay in HashiCorp Vault; the sandbox uses short-lived tokens for every API call.
Guardrails We Enforced
- Data Minimization. Prompts contain only the fields needed to answer a question. No PHI, no backup contact info, no open-ended attachments.
- Response Limits. The model is instructed to stay under 220 words, cite the specific policy it references, and include a âconfidenceâ field. If confidence <0.7 we default to âneeds manual handling.â
- Tools Registry. The model can call functions (KB lookup, policy reference, prior ticket search) but every tool is whitelisted and audited.
- Human-in-the-Loop. Nothing is sent to a customer without a human pressing âSend.â Agents can edit or request a new draft in one click.
- Rate Limits + Kill Switch. If rejection rate exceeds 20% in a rolling hour, the sandbox disables itself and notifies the on-call engineer.
Isolation Patterns That Passed Audit
Regulators were less interested in the AI model and more interested in how we isolated it. We split the architecture into a data plane (Zendesk, Supabase, Vault) and a control plane (sandbox API, prompt store, evaluation jobs). The control plane can orchestrate workflows but never stores raw transcripts. Service accounts have single-purpose scopes, and every call uses short-lived OAuth tokens minted by Vault so there are no long-lived secrets to leak.
Key patterns that repeatedly impressed auditors:
- Dual redaction. Zendesk webhooks strip obvious identifiers; the sandbox service runs a second scrub plus deterministic replacements. If the two systems disagree, we quarantine the ticket.
- Immutable artifacts. Prompt inputs, outputs, and evaluation results are written to BigQuery tables with append-only access. Investigations can replay any suggestion with zero mutation risk.
- Air-gapped evals. Offline evaluation jobs run on a separate project with no production network access, so a rogue eval script canât touch live tickets.
- Config-as-code. Prompt versions, tool registries, and policy references live in Git with CODEOWNERS. Compliance reviews diffs the same way they review infrastructure changes.
By treating isolation as a product feature (not an afterthought), we sped up legal approvals and avoided endless review cycles.
Training the Prompt Library
We treated prompt development like software:
- Dataset Assembly. Pulled 1,200 historical tickets, redacted identifiers, and tagged them with intent, resolution quality, and compliance considerations.
- Prompt Iterations. Wrote modular prompts referencing the company playbook (âIf FDA clause present, highlight evidence requirementâ). Each prompt lives in
prompts/with version history. - Evaluator Scripts. Built a small eval harness that scores AI responses on accuracy, tone, and policy alignment. CI fails if evals regress.
- Agent Feedback Loop. When agents edit suggestions they check a box (missing context, tone, compliance). A nightly job clusters that feedback and opens GitHub issues if any prompt underperforms.
Deployment Plan
- Week 1: Sandbox in shadow mode. AI suggestions invisible to agents; we evaluated drift and logging.
- Week 2: Limited pilot with 3 agents handling low-risk categories (general inquiries). Daily standups captured friction points.
- Week 3: Expanded to 50% of tickets. Introduced the confidence threshold automation.
- Week 4: Full rollout, followed by a âsupport hack dayâ where agents helped refine prompts and UI.
At every stage, compliance and legal reviewed logs to confirm privacy requirements were met.
Monitoring and SLOs
- Suggestion accuracy: 85% of agent approvals must require <10% editing.
- Turnaround time: AI draft available within 8 seconds p95.
- Deflection rate: 75% of tier-1 tickets handled without escalation.
- Confidence alerts: If three low-confidence suggestions appear in a row, trigger an incident channel.
We display these metrics in Grafana; anomalies page the on-call support engineer via PagerDuty.
Incident Response and Chaos Drills
The kill switch only works if people practice using it. Every quarter we run a âsandbox chaos hourâ where we simulate three scenarios: mass low-confidence responses, Vault token expiration, and knowledge-base mismatch. Agents, compliance, and engineering walk through the incident runbook together. We measure mean time to disable the sandbox, mean time to identify the root cause, and the quality of the customer comms template we keep on file.
If a real issue triggers the kill switch, the process is already muscle memory:
- Sandbox API flips to
maintenancemode and returns canned responses advising agents to draft manually. - PagerDuty notifies on-call plus a compliance liaison.
- A Notion incident doc is created with pre-filled sections for timeline, affected tickets, and regulatory follow-up.
- Once resolved, the runbook requires a retro with both engineering and support leads before re-enabling AI suggestions.
These drills sound heavy-handed, but they kept us calm the one time GPT latency spiked and started timing out half the queue.
Agent Experience
Agents work inside a Notion-powered desk with the following flow:
- View ticket summary, sentiment score, and relevant playbooks.
- Read AI suggestion, highlighted risks, and references (KB ID, policy clause).
- Accept, edit, or mark âneeds escalation.â We capture keyboard shortcuts to keep interactions under two seconds.
- Resend sends the final email using the component library discussed earlier.
Agents loved that the desk also surfaces contextual data (MRR, plan tier, health score). The AI suggestion becomes a starting point, not an order.
Change Management and Enablement
We underestimated how much enablement work it takes to convince agents that AI is a partner, not a threat. Before rollout we ran live clinics showing how the sandbox logs every action, how to override a suggestion, and how we track individual credit for great edits. Support leadership also rewrote the performance rubric so humans are evaluated on coaching signal and escalation qualityânot on blindly accepting AI drafts. Finally, compliance has a standing slot in the weekly support meeting to share audit findings, which keeps the partnership collaborative rather than adversarial.
Results
- Resolution time: -52% for tier-1 tickets.
- Agent satisfaction: 4.5/5 average in monthly surveys.
- Deflection: 87% of low-complexity inquiries resolved without engineering escalation.
- Compliance incidents: zero. Audit logs satisfied both SOC 2 and HIPAA reviewers.
Lessons Learned
- Sandbox first. Legal wonât approve AI unless you can demonstrate isolation and auditability from day one.
- Measure rejection reasons. We learned more from âwhy agents rejected a draftâ than from aggregate accuracy stats.
- Security isnât optional. Pulling secrets from Vault, using short-lived tokens, and logging every prompt completion kept our security team onboard.
- Invest in UI polish. Adoption skyrocketed after we reduced the number of clicks to accept/edit a suggestion.
Frequently Asked Questions
Q: How do you redact PHI reliably?
We built a lightweight PII detection service using deterministic regexes plus spaCy NER models tuned on our ticket data. Detected spans are replaced with stable placeholders so agents can rehydrate them after AI generation.
Q: Can compliance review specific prompts?
Yes. Every prompt/response pair is stored with a hash of the inputs, model version, and agent outcome. Compliance can search by ticket ID, customer, or keyword and replay the entire interaction.
Q: What about multilingual tickets?
We detect language at intake and route non-English tickets through a translation microservice before they hit the sandbox. Agents reviewing the draft see both the translated suggestion and the original text for context.
Whatâs Next
We plan to train lightweight fine-tuned models on our redacted dataset, reducing API latency and cost. We are also experimenting with automated knowledge-base updates: when AI detects a missing article, it drafts a skeleton doc for the documentation team. If you operate in a regulated industry and want to pilot a similar sandbox, contact usâweâre happy to share the Terraform configs, prompt library, and evaluator scripts that kept us compliant.