MarsalaMarsala
Back to articles
InsightJan 22, 2025·7 min read

Observability for RevOps Analytics

Monitoring freshness, volume and logic of RevOps models so sales never finds out first.

How I implemented observability for the revenue pipeline using dbt, Metaplane and actionable alerts.

By Marina Álvarez·
#Data#RevOps#Monitoring

Observability for RevOps Analytics

I'd rather be the one telling sales about an incident, not the other way around.

Context

Revenue lives and dies by clean metrics. When pipeline dashboards go red because of a stale sync or a malformed event, executives lose trust. Two years ago, we relied on humans to notice anomalies, usually during Monday forecast calls. This reactive approach led to delayed incident detection, prolonged resolution times, and a constant erosion of confidence in our data. The cost of a single data discrepancy in our revenue operations—whether it was misreported pipeline, incorrect ARR figures, or a broken lead routing—was immense, impacting sales strategy, financial forecasting, and overall business health.

I implemented observability across the entire RevOps stack—dbt models, reverse ETL processes, and CRM syncs—so anomalies show up minutes after they happen, not hours later when sales discovers them. Now, we treat our revenue models like production systems, complete with data contracts, Service Level Agreements (SLAs), and detailed runbooks. This proactive approach has drastically reduced incident response times, minimized data-related disruptions, and fundamentally transformed how our sales and leadership teams perceive the reliability of our revenue data.

Stack I leaned on

  • dbt Core + Elementary for freshness, volume, and schema tests: dbt (data build tool) is our primary tool for data transformation. We use dbt Core to define our data models and Elementary to implement comprehensive data quality tests, including freshness checks, volume anomalies, and schema validation. Elementary also provides a user-friendly interface for monitoring these tests.
  • Metaplane for machine learning-based anomaly detection on key metrics: Metaplane connects directly to our data warehouse and BI tools, providing machine learning-driven anomaly detection. It learns historical patterns in our key revenue metrics and alerts us to deviations that might indicate underlying data issues, often before dbt tests even fail.
  • Supabase storing data-contract metadata (owners, budgets, severity): Supabase serves as a lightweight, flexible database for storing metadata about our data contracts. This includes information like model owners, freshness budgets, data quality severity levels, and links to downstream dashboards, making it easy to enrich alerts with context.
  • n8n automation to open incidents and capture context: n8n is our workflow automation tool. It listens for webhooks from Elementary and Metaplane, and when an anomaly is detected, it orchestrates the incident creation process, opening tickets in Linear, enriching them with relevant context, and posting to Slack.
  • PagerDuty + Slack for alerting and on-call rotations: PagerDuty is integrated for critical alerts, ensuring that on-call engineers and RevOps analysts are immediately notified of high-severity data incidents. Slack serves as our primary communication channel for incident coordination and lower-priority alerts.
  • Notion for runbooks and incident postmortems: Notion is our central repository for documentation. We store detailed runbooks for each critical data model, outlining diagnostic steps, remediation actions, and communication protocols. Post-incident reviews are also documented here, fostering a culture of continuous learning.

Pain Points Before Observability

Before implementing a robust observability program, our RevOps analytics suffered from several critical pain points:

  1. Surprise churn: Sales would frequently ping us asking, “Is the ARR dashboard stuck?” before we even noticed that a critical ingestion job had failed. This meant sales was often the first to detect data issues, eroding their trust.
  2. Unknown ownership: While our dbt models had code owners, there was no clear incident owner for data quality issues. We wasted valuable time finding the right responders, delaying resolution.
  3. Alert fatigue: Naive data quality checks fired nightly due to expected load variances, leading to a high volume of false positives. Teams started ignoring alerts, desensitizing them to real issues.
  4. Missing context: Alerts just said “freshness test failed,” forcing responders to manually dig through logs and dashboards to understand the scope and impact of the problem. This increased Mean Time To Resolution (MTTR).

The new observability program solved each of these with clear SLAs, automation, and shared dashboards, transforming our reactive approach into a proactive one.

Architecture Snapshot

The RevOps Analytics Observability architecture is designed for comprehensive monitoring and rapid incident response:

  1. Contracts layer: YAML data_contracts/ defines tables, critical fields, freshness budgets, owners, and business impact. These contracts are version-controlled in Git and serve as the single source of truth for data expectations.
  2. Testing layer: dbt tests + Elementary run on every deployment and hourly for P0 (revenue-critical) models. These tests validate data quality, schema adherence, and freshness, providing immediate feedback on data health.
  3. Anomaly layer: Metaplane ingests query history from our warehouse and BI tools, compares metrics against expected behavior using machine learning, and suppresses noise. This identifies subtle deviations that rule-based tests might miss.
  4. Alerting layer: n8n listens to webhooks from test failures/anomalies (from Elementary/Metaplane), enriches alerts with context (owner, last good run, downstream dashboards), and triggers PagerDuty for critical incidents or Slack threads for lower-severity issues.
  5. Runbook layer: Notion template pre-populated with logs, queries, and recovery steps. Each critical data model has a dedicated runbook, ensuring that responders have immediate access to diagnostic and remediation procedures.
  6. Observability dashboards: Metabase shows pipeline health, open incidents, burn-down of freshness budgets, and reverse ETL lag. These dashboards provide real-time visibility to all stakeholders, from data engineers to sales leadership.

Playbook

Implementing observability for RevOps analytics followed a structured playbook to ensure comprehensive coverage and effective incident management.

  1. Inventory critical models: Identify and prioritize all revenue-critical data models, such as pipeline, ARR, bookings, churn, and lead routing. Assign each an SLA (Service Level Agreement) for freshness and accuracy.
  2. Define contracts: For each critical model, define a data contract in YAML. This includes schema, accepted data ranges, dependencies, downstream dashboards, and business impact. These contracts live in Git and require PR review, treating data definitions like code.
  3. Add dbt tests: Implement comprehensive dbt tests. This includes freshness tests with budgets (e.g., 30 min for pipeline data), volume tests for duplicates or unexpected row counts, and schema tests for field types and nullability.
  4. Deploy Elementary: Use elementary.yml to configure severity levels for different test failures, route alerts by model owner, and capture historical trends of data quality metrics.
  5. Configure Metaplane: Connect Metaplane to our data warehouse and BI tools. Train its machine learning models on historical data to understand seasonality and normal behavior. Mark which signals deserve PagerDuty alerts versus Slack-only notifications.
  6. Automate incidents: Configure an n8n workflow to listen to webhooks from Elementary/Metaplane. If two consecutive failures or a high-severity anomaly occurs, it automatically creates a PagerDuty incident, opens a Linear ticket, and posts to #revops-ops with enriched context.
  7. Runbooks & drills: Every critical model has a detailed runbook in Notion describing diagnostic steps (SQL queries, checks), remediation actions, and communication protocols. Quarterly drills simulate failures to keep the process sharp and build muscle memory.
  8. Share dashboards: Embed pipeline-health and SLA adherence dashboards in the RevOps wiki and Metabase. This ensures leadership and sales teams see data health proactively, fostering trust and transparency.

Key Principles of RevOps Analytics Observability

  • Proactive Detection: Identify data anomalies and quality issues before they impact business operations or are discovered by end-users.
  • Data as a Product: Treat revenue-critical data models as production systems, complete with contracts, SLAs, and dedicated ownership.
  • Context-Rich Alerting: Alerts should provide immediate, actionable context, including impacted dashboards, suggested remediation, and owner information.
  • Automated Incident Response: Leverage automation to streamline incident creation, enrichment, and communication, reducing manual effort and human error.
  • Blameless Culture: Focus post-incident reviews on system and process improvements rather than individual blame, fostering continuous learning.
  • Shared Ownership: Data quality and observability are a shared responsibility across data, engineering, and RevOps teams.
  • Continuous Improvement: Regularly review and refine data contracts, tests, runbooks, and alert thresholds based on incident learnings and evolving business needs.

Common Failure Modes (and Fixes)

  1. Alert Fatigue:
    • Problem: Too many non-actionable or false-positive alerts lead teams to ignore them, desensitizing them to real issues.
    • Fix: Tune anomaly detection thresholds carefully. Implement severity levels (PagerDuty for critical, Slack for informational). Use machine learning-based tools like Metaplane that learn normal behavior. Establish clear alert routing based on ownership.
  2. Lack of Context in Alerts:
    • Problem: Generic alerts like "test failed" force responders to manually investigate, increasing MTTR.
    • Fix: Enrich alerts with context: last successful run, impacted dashboards, suggested diagnostic queries, and links to runbooks. Use automation (n8n) to pull this information dynamically.
  3. Undefined Ownership:
    • Problem: Unclear ownership for data models or incidents leads to delays in finding the right person to address an issue.
    • Fix: Implement a clear ownership model (e.g., Supabase table mapping models to teams/individuals). Enforce this ownership in data contracts and alert routing. Establish on-call rotations.
  4. Stale Runbooks:
    • Problem: Runbooks that are outdated or incomplete are useless during an incident, leading to confusion and slower resolution.
    • Fix: Treat runbooks as living documents. Integrate runbook updates into post-incident reviews. Conduct regular fire drills to test and validate runbooks, making updates a mandatory outcome.
  5. Lack of Business Context:
    • Problem: Data teams might fix technical issues without understanding the business impact, leading to misprioritization or ineffective solutions.
    • Fix: Include business impact and downstream dashboards in data contracts. Share observability dashboards with sales and leadership. Foster cross-functional collaboration and communication during incidents.

Data Contract Example

model: fct_pipeline
freshness_budget_minutes: 30
owner:
  primary: @marina
  backup: @daniel
schema:
  - name: opportunity_id
    type: string
    constraints:
      - not_null
  - name: stage
    type: string
    allowed_values: [Prospecting, Discovery, Eval, Commit, Closed Won, Closed Lost]
pii: false
downstream:
  dashboards: [metabase://revops/pipeline, tableau://forecast/exec]
severity: P0

Contracts feed both dbt tests and alert routing; if severity is P0, PagerDuty fires immediately, ensuring rapid response for critical issues.

Alert Enrichment

Raw alerts are often useless without context. We enrich them to be immediately actionable:

  • Last successful run timestamp + warehouse job ID.
  • Query to reproduce (dbt run command) or inspect the issue.
  • Impacted dashboards and reverse ETL jobs, showing the business impact.
  • Suggested remediation steps from the runbook, guiding the responder.

This transforms a generic “Freshness failure” into a highly informative alert: “fct_pipeline stale (60 min > 30 min budget). Last DBT run: 10:05 UTC (job 1234). Downstream: Forecast dashboard, AE scorecard. Run dbt run -s fct_pipeline after verifying ingestion int_salesforce_opps. Owner: @marina (backup @daniel).”

Metrics & Telemetry

The success of our RevOps Analytics Observability program is quantified through several key metrics:

  • Incidents detected before business notices: 35 minutes earlier on average, demonstrating the effectiveness of proactive monitoring.
  • Blocked launches from bad metrics: 0 in Q1 (previous quarter had 3), indicating improved data quality and reliability.
  • Model ownership coverage: 3 engineers can maintain all tables thanks to documentation and clear ownership, improving team efficiency.
  • Freshness SLA adherence: 98% for P0 models, ensuring critical data is always up-to-date.
  • Alert-to-acknowledge time: 5 minutes median, reflecting rapid response by on-call teams.
  • False-positive rate: <6% after tuning Metaplane thresholds, reducing alert fatigue.
  • Mean time to recovery (MTTR): 28 minutes (down from 76), showcasing faster incident resolution.
  • % of incidents with documented postmortems: 100%, fostering a culture of continuous learning and improvement.

Communication Flow

A standardized communication flow ensures all stakeholders are informed during an incident:

  1. Alert hits PagerDuty → on-call acknowledges within the defined SLA.
  2. Slack bot posts to #revops-ops with enriched context + /join incident button, initiating incident coordination.
  3. IC opens Notion incident doc (auto-filled with metadata) and starts the timeline, creating a live record of the incident.
  4. Comms lead posts updates every 15 minutes in #exec-briefing until resolved, managing executive expectations.
  5. Once fixed, root cause + prevention steps go into a Linear ticket linked to the runbook, ensuring follow-up actions are tracked.

Runbook Template (Notion)

Our runbook template ensures consistency and provides immediate guidance during an incident:

  1. Summary (What failed, severity, impacted metrics).
  2. Timeline (Timestamps + actions taken).
  3. Detection (Alert source, queries used for diagnosis).
  4. Diagnosis (Steps taken, logs inspected, root cause identified).
  5. Fix (Commands run, code changes, rollback procedures).
  6. Prevention (Tests/monitoring to add to prevent recurrence).
  7. Next Steps (Owners + due dates for follow-up actions).

We keep runbooks lightweight; responders copy/paste the template at incident start, ensuring they have a structured approach from the outset.

Case Study: Ingestion Lag During SKO

  • Scenario: During Sales Kickoff (SKO), we experienced a surge in demo traffic, causing Fivetran to hit API limits. This led to our pipeline model freshness exceeding its 30-minute budget, reaching 90 minutes.
  • Detection: Elementary's freshness test burned its budget, triggering an alert. Metaplane simultaneously flagged an unusual ingestion lag, indicating a broader data pipeline issue. n8n orchestrated the creation of a PagerDuty incident.
  • Response: The on-call team followed the fct_pipeline runbook: checking Fivetran logs, switching to a backup ingestion job, and running catch-up dbt models.
  • Outcome: Dashboards refreshed within 25 minutes, and executives received a proactive update from the Comms Lead. We subsequently added an auto-scaling rule for Fivetran and tuned alert thresholds for SKO season.

Governance & Reviews

Robust governance and regular reviews are essential for maintaining a healthy observability program:

  • Weekly health review: A quick sync to review open incidents, SLO burn, and upcoming risky releases, ensuring proactive management of data health.
  • Monthly SLA scorecard: Shared with the CRO + Finance to demonstrate data reliability metrics and secure continued buy-in from leadership.
  • Quarterly contract audit: Confirm owners are still correct, deprecate unused tables, and update severity labels, keeping our data contracts current.
  • Postmortems: Blameless, published to Confluence, with action items tracked in Linear, fostering a culture of continuous learning and improvement.

Cost Snapshot

The investment in RevOps Analytics Observability is a fraction of the cost of data-related incidents and misinformed business decisions.

  • Metaplane: ~$600/month (covers data volumes + seats for ML-based anomaly detection).
  • Elementary: Open-source + ~$50/month for hosting/warehouse compute (for dbt data quality tests).
  • PagerDuty: Part of broader org plan (~$150/month allocated to RevOps for critical alerting).
  • n8n on Fly.io: ~$15/month (for incident automation workflows).
  • Supabase: ~$25/month Pro plan for metadata storage and webhook triggers.
  • Notion (existing subscription): No incremental cost for runbooks and postmortems.
  • Engineering/Data Team Time: Approximately 0.5-1 day per week for maintenance, contract refinement, and incident response.

The total incremental tooling cost is less than $900/month. This is a minimal investment compared to the potential losses from an eight-figure pipeline being driven by untrustworthy data. The ROI is clear in reduced incident frequency, faster resolution, and increased confidence in our revenue metrics.

FAQ

Q: How do you prevent alert fatigue with so many monitoring tools? A: We use a tiered alerting system. Metaplane's ML-based detection is tuned to minimize false positives, and Elementary's alerts are configured with severity levels. Only P0 incidents trigger PagerDuty; lower-severity issues go to Slack. We also continuously tune thresholds and review alerts in our weekly health reviews.

Q: What's the process for adding a new critical data model to the observability program? A: When a new critical data model is introduced, it must first have a defined data contract in YAML, including owners, freshness budgets, and severity. Then, dbt tests are implemented, and Metaplane is configured to monitor its key metrics. Finally, an n8n workflow is set up to automate incident creation and alerting.

Q: How do you ensure runbooks stay up-to-date? A: Runbooks are treated as living documents. Every incident postmortem includes a mandatory step to review and update the relevant runbook. We also conduct quarterly fire drills where runbooks are actively tested and refined based on the drill's outcomes.

Q: Can sales or marketing teams access these observability dashboards? A: Yes, our Metabase dashboards for pipeline health and SLA adherence are embedded in the RevOps wiki and accessible to sales and marketing leadership. This transparency fosters trust and allows them to proactively monitor data health without needing to ask.

Q: What happens if an incident occurs outside of business hours? A: Our PagerDuty integration ensures 24/7 coverage. On-call engineers and RevOps analysts are part of a rotating schedule, ensuring that critical incidents are acknowledged and addressed promptly, regardless of the time of day.

What stuck with me

  • Alerts need context; I include queries and next steps every time. Generic alerts are useless. Providing immediate, actionable information drastically reduces MTTR.
  • Document budgets so teams stop arguing about severity. Clear SLAs and error budgets provide objective criteria for prioritizing reliability work and managing expectations.
  • Observability is a team sport—RevOps, data, and engineering co-own the pipeline. Breaking down silos and fostering shared responsibility is crucial for a healthy data ecosystem.
  • Drills matter: simulate failures so the process is second nature. Practicing incident response in a controlled environment builds muscle memory and confidence, leading to calmer, faster resolutions during real incidents.

What I'm building next

I'm exploring GitHub hooks that block PRs when contracts break (e.g., removing a column without updating YAML), enforcing data quality at the code level. I’m also prototyping a “data status” widget so sales sees pipeline health next to their forecast, providing real-time transparency directly within their workflow. Want to beta it? reach out.


Want me to help you replicate this module? Drop me a note and we’ll build it together.

Marsala OS

Ready to turn this insight into a live system?

We build brand, web, CRM, AI, and automation modules that plug into your stack.

Talk to our team