MarsalaMarsala
Back to articles
InsightJan 6, 2025·10 min read

AI Ops War Room: Live Control Board for Operations

I built a war room that blends metrics, alerts and AI assistants to react in minutes.

Insight on centralizing critical metrics, alerts and actionable playbooks with AI.

By Marina Álvarez·
#Operations#AI#Dashboards

AI Ops War Room: Live Control Board for Operations

The war room lives in a dark dashboard where AI summarizes and suggests actions.

Context

Before the AI Ops War Room, our operational meetings were a chaotic blend of disparate tools and manual processes. Imagine a typical ops meeting: someone would be fumbling through a spreadsheet, another trying to refresh a stale dashboard, and critical alerts would be lost in a flurry of Slack messages. This fragmented approach led to slow decision-making, missed incidents, and a constant state of reactive firefighting. The lack of a unified view meant that our team spent more time correlating information than actually solving problems. We needed a solution that could cut through the noise, centralize critical information, and empower our operations team to make rapid, informed decisions.

I envisioned a digital "war room" – a single pane of glass that would unify metrics, alerts, and actionable playbooks, all augmented by AI copilots. The goal was to transform our ops meetings from hours-long correlation exercises into minutes-long decision sprints. This war room, built as a Notion + Metabase hub, backed by Snowflake and PostHog, and powered by OpenAI function calls, was designed to summarize complex situations and proactively suggest actions, ensuring our team could react to incidents in minutes, not hours.

Architecture Overview

The AI Ops War Room is built on a robust, layered architecture designed for scalability, real-time insights, and actionable intelligence.

  • Data sources: Our operational intelligence begins with comprehensive data ingestion. We pull financial and operational data from Snowflake, product usage insights from PostHog, customer support interactions from Zendesk, and billing information from Stripe. This diverse data landscape provides a 360-degree view of our operations.
  • Processing: Data from various sources is transformed and validated using dbt (data build tool), ensuring data quality and consistency. Metaplane is integrated for continuous data observability, detecting anomalies and schema changes before they impact our dashboards. All processed metrics are then published to Metabase, our business intelligence platform.
  • War room UI: The central interface of our war room is a Notion page. This flexible environment allows us to embed dynamic Metabase dashboards directly, providing real-time visualizations of our KPIs. Custom React widgets (microfrontends built with Next.js and served via Vercel embeds) enhance the Notion page with interactive elements, such as CTA buttons for triggering playbooks and a chat-like action log.
  • AI Copilot: The intelligence layer is powered by OpenAI function calling with GPT-4o mini. The AI copilot leverages real-time metrics and comprehensive playbook metadata to summarize complex situations, identify root causes, and draft recommended actions. Crucially, guardrails are implemented to ensure AI suggestions align with our operational policies and require human confirmation for critical actions.
  • Automation hooks: To translate insights into action, we've integrated n8n workflows and PagerDuty incidents. These are triggered directly via interactive buttons within the Notion UI, allowing our ops team to launch automated playbooks or escalate incidents with a single click.
Snowflake/dbt ---> Metabase ---> Notion embeds
      |                          |
PostHog ---> Supabase (anomalies) ---> OpenAI Copilot ---> n8n playbooks

This architecture ensures that data flows seamlessly from raw sources to actionable insights, with AI augmenting human decision-making and automation streamlining response.

Core Components

The AI Ops War Room is composed of several interconnected components, each playing a vital role in centralizing information and enabling rapid response:

  1. Mission KPIs: We identified three critical Key Performance Indicators (KPIs) for each operational track (e.g., MRR growth for finance, lead velocity for growth, support backlog for customer service). Each KPI is displayed with clear budgets (e.g., churn must stay <1.5% monthly) and trendlines, providing an immediate pulse on operational health. Escalation policies are documented alongside each budget.
  2. Anomaly Feed: This component is the early warning system. Metaplane and PostHog webhooks continuously monitor our data streams for unusual patterns or deviations. Detected anomalies are pushed into a Supabase database, where the AI copilot summarizes their potential impact and suggests relevant playbooks from the library.
  3. Playbook Library: A comprehensive collection of pre-defined operational procedures, stored as a Notion database and versioned in Git (exported as YAML files). Each playbook details the trigger conditions, step-by-step actions, responsible owners, and associated automation hooks (e.g., n8n workflow IDs, PagerDuty service IDs).
  4. Action Log: A Supabase table that meticulously records every action taken within the war room. This includes who triggered what, the outcome of the action, and any follow-up tasks. This log provides a complete audit trail, crucial for post-incident reviews and accountability.

Playbook for Building the War Room

Building the AI Ops War Room was a structured, iterative process, focusing on integrating data, AI, and automation into a cohesive operational hub.

1. Define Signals & Budgets

  • Collaborate with executives: Work closely with executive leadership to identify and prioritize 3-5 critical KPIs for each major domain (Growth, Product, Support, Finance). These should be the metrics that truly move the needle for the business.
  • Document budgets and escalation policies: For each defined KPI, establish clear operational budgets (e.g., "churn must stay <1.5% monthly," "lead velocity must not drop below 100 leads/day"). Crucially, document the escalation policies: what happens when a budget is breached, who is notified, and what are the immediate next steps. This sets clear expectations and triggers for action.

2. Build the Data Backbone

  • Consolidate metrics in Snowflake: Centralize all relevant operational metrics within Snowflake. Utilize dbt models to transform raw data into clean, aggregated tables, specifically fct_kpi_daily for daily performance tracking and fct_anomaly_events for capturing unusual data patterns.
  • Implement data quality tests: Integrate dbt tests to ensure the freshness, accuracy, and completeness of your data. Configure these tests to trigger alerts (e.g., page the ops guard) if data quality falls below predefined thresholds, preventing stale or incorrect data from reaching the war room.

3. Layer the UI

  • Embed Metabase dashboards in Notion: Create dynamic Metabase dashboards for each KPI and operational area. Embed these dashboards directly into the Notion war room page using secure tokens, providing real-time, interactive visualizations.
  • Develop custom Next.js widgets: Enhance the Notion UI with custom React widgets (microfrontends). These can include interactive CTA buttons for triggering playbooks, a chat-like action log for real-time updates, or custom data displays that Notion/Metabase don't natively support. Serve these via Vercel embeds for seamless integration.

4. Wire AI Copilot

  • Craft prompt templates: Develop sophisticated prompt templates for the OpenAI copilot. These templates should summarize KPIs, highlight anomalies, and recommend specific playbooks based on the current operational context.
  • Define function calling actions: Configure OpenAI's function calling capabilities to enable the AI to trigger specific actions, such as trigger_playbook, open_incident, or summarize_status.
  • Implement guardrails: Crucially, build guardrails around the AI copilot. Ensure that the AI cannot trigger actions above a certain severity level without explicit human confirmation, preventing unintended consequences.

5. Automate Playbooks

  • Store playbooks as YAML: Each operational playbook should be stored as a version-controlled YAML file within your Git repository. This YAML defines the playbook's ID, severity, trigger conditions, input parameters, and the n8n workflow ID it should invoke.
  • Integrate n8n workflows: Design n8n workflows that correspond to each YAML-defined playbook. Buttons within the Notion war room UI should call an API endpoint that launches the relevant n8n workflow, with its status and outcome fed back into the Action Log.
id: support_surge_plan
severity: 2
triggers:
  - metric: support_backlog
    condition: "> 150"
steps:
  - type: n8n
    workflow_id: support-surge
  - type: slack
    channel: "#support-war-room"
    template: surge_notice.md
approvals:
  required_roles: ["IC","Support Lead"]

Every button references a definition like this so we can audit what changed between versions.

6. Rollout & Training

  • Weekly ops syncs: Conduct weekly operational sync meetings directly within the war room. Encourage every discussion to reference the live dashboards and AI summaries, fostering a data-driven culture.
  • Establish guard rotation: Implement a clear guard rotation schedule, ensuring that a designated individual is responsible for monitoring alerts and responding to incidents 24/7.
  • Publish enablement materials: Create and distribute enablement materials, such as Loom video walkthroughs and mini-quizzes, to ensure every new operator understands how to interpret budgets, trigger playbooks, and log actions effectively.

7. Implementation Timeline

| Week | Milestone | |------|-----------| | 1 | Define KPIs/budgets, design dashboard layout, list playbooks | | 2 | Build Snowflake/dbt layers, connect Metabase, set up anomaly feed | | 3 | Prototype AI Copilot + action log in Supabase | | 4 | Integrate n8n playbooks, run drills, onboard guard rotation | | 5 | Roll out to broader org, create read-only views for marketing/finance |

Key Principles of an AI Ops War Room

  • Centralized visibility: Unify all critical operational metrics, alerts, and communication channels into a single, accessible dashboard to eliminate information silos.
  • AI-augmented decision making: Leverage AI copilots to summarize complex situations, identify anomalies, and suggest actionable playbooks, accelerating response times.
  • Automation of routine actions: Automate the execution of pre-defined playbooks and incident response workflows to reduce manual effort and ensure consistent execution.
  • Human-in-the-loop control: Maintain human oversight and confirmation for critical AI-suggested actions, balancing automation efficiency with human judgment and accountability.
  • Data quality and observability: Implement robust data pipelines with continuous testing and anomaly detection to ensure the accuracy and freshness of operational metrics.
  • Version-controlled playbooks: Treat operational playbooks as code, storing them in Git with version control to enable auditability, collaboration, and continuous improvement.
  • Continuous feedback and iteration: Establish regular feedback loops with the operations team to refine AI models, playbooks, and the overall war room functionality.

Common Failure Modes (and Fixes)

  1. Alert fatigue and ignored notifications:
    • Problem: Overly sensitive alerts or a high volume of non-critical notifications can lead to ops teams ignoring the war room, defeating its purpose.
    • Fix: Implement intelligent alert routing and severity levels. Tune anomaly detection models to minimize false positives. Use PagerDuty for critical, actionable alerts and Slack for informational updates. Establish clear escalation paths.
  2. Stale data and lack of trust:
    • Problem: If the data displayed in the war room is outdated, inaccurate, or inconsistent, the ops team will lose trust in the system and revert to manual checks.
    • Fix: Implement robust data quality tests (dbt tests) and freshness checks. Integrate data observability tools (Metaplane) to detect and alert on data pipeline issues. Clearly display data freshness timestamps on dashboards.
  3. AI "hallucinations" or irrelevant suggestions:
    • Problem: If the AI copilot provides inaccurate summaries or suggests irrelevant playbooks, it can erode trust and lead to wasted effort.
    • Fix: Continuously refine AI prompt templates with specific instructions and guardrails. Ensure the AI has access to the most relevant and up-to-date context. Implement a strong human-in-the-loop mechanism for critical suggestions and gather feedback on AI accuracy.
  4. Complex or undocumented playbooks:
    • Problem: Playbooks that are difficult to understand, lack clear steps, or are not easily accessible will hinder rapid response and adoption.
    • Fix: Store playbooks as version-controlled YAML in Git, making them easy to read and update. Supplement with Loom video walkthroughs and clear Notion documentation. Conduct regular playbook drills to ensure familiarity and identify areas for improvement.
  5. Resistance to change and adoption issues:
    • Problem: Ops teams accustomed to traditional workflows may resist adopting a new, AI-augmented system without proper buy-in and training.
    • Fix: Involve the ops team in the design and iteration process from the beginning. Highlight how the war room reduces manual toil and empowers faster decision-making. Provide comprehensive training and celebrate early wins to build confidence and foster adoption.

Guard Rotation

To ensure continuous coverage and distribute responsibility, we implemented a structured guard rotation:

| Role | Responsibility | |------|----------------| | IC | Triage anomalies, drive decisions, log actions | | Comms | Post updates in #ops-war-room, escalate to execs | | Automation lead | Launch playbooks, monitor workflows |

PagerDuty schedule rotates weekly. Drills run monthly to keep muscles sharp.

Guard Checklist

  • Check anomaly feed every hour.
  • Verify Copilot summaries (accuracy + recommended playbooks).
  • Ensure Action Log entries have owners and follow-up dates.
  • Host a 10-minute end-of-shift handoff Loom for the next guard.

Metrics & Impact

The implementation of the AI Ops War Room has yielded significant improvements across our operational efficiency and incident response capabilities:

  • Reduced meeting time: Daily stand-up time dropped from 45 minutes to 18 minutes because decisions were already prepared and context was readily available.
  • Faster alert handling: Alerts were handled in under 10 minutes (p90), a drastic improvement from previous response times.
  • Full accountability: 100% of actions were logged with owners, creating an auditable trail for every operational decision and outcome.
  • Increased efficiency: AI suggestions were adopted 62% of the time, saving approximately 4 hours per week in manual analysis and decision-making.
  • Improved budget adherence: Budget breaches were reduced from 6 per quarter to only 1, indicating better proactive management of operational KPIs.
  • Reduced incident resolution time: The Mean Time To Resolution (MTTR) for critical incidents decreased by 30%, thanks to faster detection and automated playbook execution.
  • Enhanced cross-functional collaboration: Post-incident retrospectives showed a 25% improvement in collaboration scores between engineering, product, and operations teams.

Sample Workflow

This example illustrates how the AI Ops War Room streamlines incident response:

  1. Detection: Metaplane detects a significant spike in the support backlog, exceeding predefined thresholds.
  2. Alert & AI Summary: An alert is pushed to Supabase, and the AI copilot generates a summary, posting it in the war room with a suggested playbook (“Activate support surge plan”).
  3. Human Action: The Incident Commander (IC) reviews the AI summary, assesses the situation, and clicks the “Run surge plan” button within the Notion UI.
  4. Automated Response: n8n automatically opens a PagerDuty incident, notifies the on-call Customer Success team, spins up a dedicated Slack channel for the incident, and adjusts Zendesk priorities to reflect the surge.
  5. Audit Trail: The Action Log records every step, and the Notion page updates automatically with real-time status.

Action Log Schema

The Action Log is a critical component for auditability and post-incident analysis:

| Column | Description | |--------|-------------| | action_id | UUID for traceability | | triggered_by | Slack handle of human who approved/ran the action | | playbook_id | References YAML definition | | inputs | JSON payload sent to n8n | | status | Pending / Running / Completed / Rolled back | | postmortem_link | Optional link if action escalated to incident |

We expose this table directly in Notion so execs can audit decisions without pinging ops leads.

Case Study: Q4 Infrastructure Incident

During a critical renewal week in Q4, our Snowflake warehouse experienced a stall, threatening our MRR forecast. The AI Ops War Room proved invaluable:

  • Detection: Anomaly hit the war room; Copilot summarized impact (“MRR forecast at risk, ingestion lag 45 min”) and suggested playbook “Switch to backup warehouse + pause heavy jobs”.
  • Response: The IC launched the playbook: n8n invoked Terraform to scale the standby warehouse, triggered PagerDuty for data engineering, and posted status updates with countdown timers.
  • Resolution: Within 14 minutes the backlog cleared, and finance received a single Slack thread documenting the fix. No frantic DMs, no conflicting dashboards. The incident was contained and resolved with minimal impact.

Communication Rhythm

Effective communication is paramount during operational incidents. The war room establishes a clear rhythm:

  • Morning brief: Copilot posts KPI summary + anomalies at 8am local time.
  • Midday update: guard shares decisions taken + blockers in #ops-war-room.
  • EOD digest: automated summary of actions, incidents, outstanding items; execs read it asynchronously.
  • Weekly exec note: condensed version emailed to leadership, linking to action log + upcoming risks.

Lessons Learned

  • The war room doesn’t replace human judgment—it accelerates it. AI provides context and suggestions, but the final decision and accountability remain with the human operator.
  • Playbooks must be editable from the dashboard; buried docs kill adoption. Ease of access and modification for playbooks is crucial for their utility and adoption.
  • Copilot needs context + guardrails; otherwise it hallucinates action plans. The AI's effectiveness is directly tied to the quality of its input data and the constraints placed upon its responses.
  • Logging is non-negotiable; exec trust comes from audits. A comprehensive action log provides transparency and builds confidence in the operational process.
  • Communication cadence is as important as tooling—quiet war rooms get ignored. Regular, structured communication ensures stakeholders are informed and engaged.
  • Document playbooks like code; YAML + Git reviews prevented people from running stale instructions. Version control for playbooks ensures they are always up-to-date and reliable.
  • Run postmortems even for “successful” actions—the guard rotation reviews action logs weekly and files small retro notes when something felt clunky. Those micro-retros keep the war room evolving instead of calcifying.

FAQ

How do you keep AI summaries accurate? We feed the model only curated metrics + context (budgets, deltas) and cap temperature at 0.2. Humans confirm before execution. This ensures that the AI's output is grounded in reliable data and human oversight.

What if someone triggers the wrong playbook? Each action requires confirmation + reason. The guard can roll back via n8n or trigger “undo” playbook. Our system is designed with safety nets to prevent unintended consequences.

Can teams outside ops use the war room? Yes—marketing has a read-only view, and AI can draft weekly exec briefs automatically. This promotes transparency and cross-functional awareness without granting full operational control.

What about security? Every action is authenticated via Supabase, and OpenAI requests run through a proxy that strips sensitive fields. Only the IC can press automation buttons; others can comment. We prioritize data security and access control.

Do you version playbooks? Yes—GitHub hosts YAML definitions; pull requests include reviewers from ops, eng, and compliance before a button becomes available in the UI. This ensures playbooks are treated with the same rigor as code.

What I'm building next

I'm actively working on several enhancements to the AI Ops War Room. This includes using smaller local models (OpenAI GPT‑4o mini + Llama Guard) for anomaly detection at the edge, which will reduce latency and improve real-time responsiveness. I'm also integrating budget burn-down charts directly into Slack, providing immediate visibility into KPI adherence.

Furthermore, I'm building an “Ops API” so other teams can query the same action log, subscribe to anomaly events, and embed mini-war-room widgets in their own dashboards. This will democratize operational insights and foster a more proactive, data-driven culture across the organization.

Cost Snapshot

The investment in the AI Ops War Room is minimal compared to the value it delivers in terms of efficiency, reduced downtime, and improved decision-making.

  • Metabase Pro + embedding: $150/month. This covers the core BI platform and the ability to embed dashboards securely within Notion.
  • OpenAI GPT‑4o mini usage: ~$120/month (for summaries + copilot). This cost scales with usage but remains highly efficient for the intelligence it provides.
  • n8n on Fly.io: $25/month. This covers the self-hosted automation engine, allowing for custom workflows and integrations.
  • PagerDuty: Part of our global plan, with an allocated cost of ~$80/month to ops. This ensures critical alerts are handled promptly.
  • Supabase: Our Pro plan costs $25/month, used for anomaly feeds, action logs, and data contract metadata.

The total incremental spend is less than $400/month. This is a trivial cost compared to the potential financial impact of a single avoided outage, which can easily save a six-figure renewal. The tooling cost is easily justified by the clarity, speed, and reliability gained in our operational processes.


Want me to help you replicate this module? Drop me a note and we’ll build it together.

Marsala OS

Ready to turn this insight into a live system?

We build brand, web, CRM, AI, and automation modules that plug into your stack.

Talk to our team