MarsalaMarsala
Back to articles
InsightDec 23, 2024·7 min read

Prompt Ops for Growth: How I Version Prompts

My system for versioning prompts, monitoring outputs and avoiding hallucinations in growth flows.

Practical insight for running prompt operations: repos, linters and review rituals.

By Marina Álvarez·
#AI#Process

Prompt Ops for Growth: How I Version Prompts

I treat prompts like code: tests, PRs and success metrics.

Context

It all started with a broken nurture campaign. A well-intentioned but poorly-edited prompt resulted in a confusing and off-brand email being sent to thousands of our users. The fallout was immediate: a spike in support tickets, a drop in engagement, and a general loss of trust in our AI-powered communications.

This incident was a wake-up call. We realized that we couldn't treat our prompts like simple copy. They were a critical part of our infrastructure, and they needed to be managed with the same rigor and discipline as our code.

That's when I established Prompt Ops, a new function that brings engineering-grade standards to the world of prompt management. Our growth squads now treat prompts like code: they are version-controlled in Git, they have automated tests, they go through a formal release process, and they are subject to incident management. Marketing still owns the copy, but they now work within a structured framework that ensures quality, consistency, and safety. They merge pull requests, but only after a thorough review process that includes automated checks and human oversight.

Stack I leaned on

  • Git repo (prompts/) with structured JSON + markdown context: We use a dedicated Git repository to store all of our prompts. Each prompt is a markdown file with a YAML front matter that contains metadata like the prompt's owner, purpose, and constraints. This gives us version control, a clear history of changes, and the ability to manage our prompts like any other codebase.
  • Playwright + custom eval harness, plus OpenAI Evals for regression tests: We use a combination of Playwright, a custom evaluation harness, and OpenAI Evals to test our prompts. Playwright is used for functional tests that call the prompt via an API and ensure that the response is valid. Our custom eval harness and OpenAI Evals are used for regression tests that track the tone, accuracy, and hallucination rate of our prompts over time.
  • Linear workflow with reviewers from growth, product, legal: All prompt changes go through a formal review process in Linear. This process includes reviewers from the growth, product, and legal teams, who ensure that the prompt is effective, on-brand, and compliant with all relevant regulations.
  • Datadog metrics + PostHog analytics for prompt usage/outcome tracking: We use Datadog and PostHog to track the usage and outcome of our prompts. This allows us to see how our prompts are performing in the real world and to identify any issues that need to be addressed.
  • Supabase storing prompt versions and metadata for runtime lookup: We use Supabase to store our prompt versions and metadata. This allows our runtime services to fetch the correct version of a prompt at runtime and to enforce any constraints that are defined in the prompt's metadata.

Prompt Lifecycle

  1. Proposal: marketing/product opens Linear ticket with business goal, inputs, outputs, KPIs.
  2. Design doc: Notion template describing context, persona, risks, guardrails.
  3. Implementation: create prompt file with metadata header (owner, expiry, dataset).
  4. Testing: run npm run prompt:test slug to execute Playwright/regression suite.
  5. Review: CODEOWNERS require AI lead + growth PM approval. Legal reviews sensitive flows.
  6. Release: merge to main, GitHub Action publishes prompt to Supabase, updates changelog, notifies #prompt-ops.
  7. Monitoring: Datadog dashboard tracks success rate, tokens, fallback usage. Alerts fire when drift > threshold.
  8. Retire: prompts expire after 60 days unless renewed; script flags stale prompts weekly.

The Core Principles of Prompt Ops

  • Prompts are code: Treat your prompts with the same rigor and discipline as your code. This means version control, automated testing, and a formal release process.
  • Collaboration is key: Prompt Ops is a team sport. It requires close collaboration between marketing, product, engineering, and legal.
  • Data is everything: Use data to track the performance of your prompts and to identify any issues that need to be addressed.
  • Automation is your friend: Automate as much of the Prompt Ops lifecycle as possible. This will free up your team to focus on more strategic work.
  • Safety first: Put safeguards in place to prevent your prompts from generating harmful or off-brand content.

Repo Structure

prompts/
  onboarding/
    nurture_onboarding_v3.md
    nurture_onboarding_v3.tests.ts
  support/
    escalation_summary_v2.md
datasets/
  onboarding_es.jsonl
  onboarding_en.jsonl
scripts/
  publish.ts
  eval.ts

Markdown files include YAML front matter (owner, purpose, constraints) followed by template text. Tests live next to prompts so contributors see everything in one place.

Runtime API

At runtime services hit a lightweight API:

const prompt = await promptStore.fetch({
  id: "nurture_onboarding_v3",
  locale: "es-MX",
  version: "latest"
});
const response = await aiClient.generate(prompt.render({ user_name, usage_summary }));

Versions are immutable; rollbacks simply switch the pointer to a previous version.

Changelog

  • Every merge triggers scripts/publish.ts which updates CHANGELOG.md.
  • Entries include prompt id, intent, datasets touched, success metrics, and fallback instructions.
  • Growth + support read the changelog before campaigns; trust improved dramatically because people know what changed.

Prompt Schema Example

id: nurture_onboarding_v3
owner: @marina
purpose: "Post-signup activation email"
inputs:
  - user_name
  - product_usage_summary
  - plan_type
constraints:
  length: "<200 words"
  tone: "confident, warm"
  required_sections:
    - value_statement
    - single CTA
eval_sets:
  - onboarding_es
  - onboarding_en
expiry: 2025-03-01

Runtime services fetch metadata alongside prompt text to enforce constraints (e.g., the transformer ensures CTA exists).

Testing & Evals

  • Unit tests: snapshots verifying placeholders, markdown structure, guardrail text.
  • Functional tests: Playwright flows calling the prompt via API, ensuring responses include CTA and no banned phrases.
  • Regression evals: monthly dataset of real user contexts; we track tone score, accuracy, hallucinations.
  • Red-team prompts: legal/compliance create adversarial inputs (PII requests, policy violations) to ensure safeguards hold.

CI fails if tests/evals drop below thresholds. Reports attach to PR for reviewers.

Eval Harness Snippet

npm run prompt:eval nurture_onboarding_v3 \
  --dataset datasets/onboarding_en.jsonl \
  --metrics tone,length,cta_presence

Results persist in Supabase so we can graph drift over time.

Incident Response

If a prompt misbehaves (e.g., hallucinated pricing):

  1. PagerDuty alert triggered from Datadog metric (success rate < threshold).
  2. On-call flips feature flag to fallback prompt (stored in Supabase).
  3. Incident doc auto-creates with offending inputs/outputs.
  4. Root cause analysis, fix/new version, release notes, and communication to stakeholders.

We treat incidents like service outages; response time averages 12 minutes.

Governance & Training

  • Monthly “prompt council” reviews upcoming experiments, approves new data sources, and audits incidents.
  • Designers + PMMs attend workshops on writing structured instructions and negative examples.
  • Onboarding includes a “prompt spelunking” exercise: new hires submit a PR, run evals, and present results.
  • Every prompt owner rotates through on-call so they feel the impact of sloppy edits.

Metrics & Telemetry

  • Reduced incidents: Incidents caused by prompt edits have been reduced by 80%.
  • Faster approval time: The time it takes to get a prompt update approved has been reduced from 3 days to 8 hours.
  • Increased team confidence: Team confidence in our AI-guided flows has increased to 9/10.
  • Full test coverage: 100% of our prompts now have automated evaluation coverage.
  • High changelog adoption: 95% of our launches are now referenced in our weekly growth call, thanks to the changelog.
  • High AE satisfaction: Account Executive satisfaction with our AI-generated copy is 4.6/5, according to our latest survey.
  • Cost tracking: We now track the cost per prompt edit (including tokens and reviews), which helps our Product Marketing Managers understand the tradeoffs of different approaches.

Monitoring Dashboard

  • Success rate (user took intended action).
  • Escalation rate (prompt fell back to human copy).
  • Token usage per prompt (cost visibility).
  • Drift vs. baseline metrics (tone, length, key phrases).
  • Recent releases and incident statuses.

Ops + marketing watch the same dashboard, so no one wonders “is AI behaving?”.

Cost of Skipping Prompt Ops

Before this system, a bad prompt blasted 4k customers with outdated pricing and cost us two enterprise deals. That single incident cost more than the engineer time we now invest in governance.

What stuck with me

  • Ownerless prompts become Frankenstein monsters; assign module captains.
  • Documentation needs negative examples so people know what to avoid.
  • Treat prompts as config; smaller files, more reusability.
  • Bundle prompts with datasets; drift detection matters more than clever wording.

Cost Snapshot

  • OpenAI usage: ~$320/mo (we cut tokens by enforcing constraints).
  • Datadog dashboards: part of existing plan; incremental $0.
  • Supabase: $25/mo for metadata API + eval logs.

Cheap compared to lost revenue when a prompt hallucinates pricing.

Case Study: Support Escalation

Support wanted AI summaries of ongoing incidents. Prompt Ops ensured:

  1. Prompt stored next to dataset with legal-approved language.
  2. Tests verified no sensitive PII leaked.
  3. Monitoring tracked hallucination rate (target <2%).
  4. When a vendor outage caused drift, alerts fired; on-call reverted to previous version within 8 minutes.

Without Prompt Ops we would have let AI email customers inaccurate info.

Implementation Timeline

  • Week 1: create prompt repo, add schema + lint rules, migrate top 5 prompts.
  • Week 2: build eval harness, connect Datadog metrics, set up changelog + automation.
  • Week 3: onboard marketing to PR flow, run first red-team session, configure PagerDuty incidents.
  • Week 4: expand coverage to support + success prompts, run retro, iterate.

Tooling Cost

  • GitHub + Actions: existing plan.
  • Playwright + eval tokens: ~$60/mo.
  • Datadog dashboards: part of company contract.
  • PagerDuty + Slack: reused from other teams.

Total incremental spend <$100/mo; savings from fewer incidents far outweigh it.

Prompt Ops also boosted morale—marketing trusts AI again because they can see exactly how outputs are tested and monitored.

Next up: auto-generate pull requests when eval drift surpasses thresholds so the system proposes fixes before humans even look.

FAQ

  • Do copywriters still matter? Absolutely—they write the tone, examples, and acceptance criteria; Prompt Ops just gives them safer tooling.
  • How do you handle vendor changes? Prompt configs reference abstraction layer; swapping model providers is a config change, not a repo-wide edit.

What I'm building next

I'm shipping a repo template with linting built in plus a Datadog dashboard starter. Want early access? send me your GitHub.


Want me to help you replicate this module? Drop me a note and we’ll build it together.

Marsala OS

Ready to turn this insight into a live system?

We build brand, web, CRM, AI, and automation modules that plug into your stack.

Talk to our team