MarsalaMarsala
Back to articles
GuideDec 6, 2024·7 min read

Data Quality Fire Drills for Growth Teams

I run data fire drills so nobody panics when a real incident hits.

Guide for executing data-quality fire drills: scripts, roles and learnings.

By Marina Álvarez·
#Data#Ops

Data Quality Fire Drills for Growth Teams

Data crises are easier when you already rehearsed them.

Context

Whenever a dashboard broke, Slack turned into chaos: growth blamed data, data blamed engineering, and leadership demanded hourly updates. We realized our “incident process” was a Google Doc nobody opened. This reactive approach led to prolonged outages, misinformed decisions, and a significant erosion of trust in our data. The cost of a single data quality incident—whether it was misattributed revenue, incorrect marketing spend, or a broken customer journey—was far too high, both financially and in terms of team morale.

I borrowed from SRE playbooks and instituted monthly data quality fire drills so growth, analytics, and engineering teams could practice together. These drills simulate real-world data corruption, dashboard drops, and schema changes. The goal is to build muscle memory, ensuring that when production data breaks, we default to a calm, coordinated response instead of panic. This proactive approach has transformed our data incident response, making our teams more resilient and our data more trustworthy.

Stack I leaned on

  • Supabase clones for staging: Supabase provides an isolated, disposable environment for staging our data. This allows us to inject corrupt data or simulate failures without impacting production systems, making it safe to run realistic drills.
  • dbt seeds with corrupt data: We use dbt (data build tool) to manage our data transformations. For drills, we create dbt seeds containing intentionally corrupt data (e.g., duplicate leads, null values in critical fields) to simulate various data quality issues.
  • PagerDuty for simulations: PagerDuty is used to simulate real-world alerting. We configure a "drill" service that triggers alerts to our on-call team, complete with simulated notifications and escalation policies, mirroring actual incident response.
  • Notion as the incident logbook: Notion serves as our central repository for incident documentation. During drills, it acts as a live logbook where the Scribe records events, decisions, and follow-up tasks, ensuring a comprehensive audit trail for post-drill retrospectives.

Symptoms We Target

Our data quality fire drills are designed to address specific, common symptoms of data degradation that directly impact growth teams:

  1. Freshness gaps: Dashboards stuck at T-24 hours because ingestion paused or a critical transformation failed. This leads to outdated insights and delayed decision-making.
  2. Silent schema drift: Product teams add or modify a column, dbt tests fail, but alerts never fire, leading to broken downstream dashboards or reports.
  3. Duplicate leads: Marketing automation double-counts conversions, skewing Customer Acquisition Cost (CAC) metrics and leading to inefficient spend.
  4. Incorrect filters: Looker explores or other BI tools referencing deprecated fields, leading to inaccurate segmentation or reporting.

Each drill focuses on one of these symptoms, allowing the team to practice detection, communication, and resolution for a specific type of data incident.

Roles & RACI

A clear RACI (Responsible, Accountable, Consulted, Informed) matrix is crucial for effective incident response during drills and real incidents:

| Role | Responsibilities | |------|------------------| | Incident Commander (IC) | Accountable for the overall incident. Owns the timeline, decides when to escalate, manages the Zoom room, and ensures the team stays focused. | | Data Detective | Responsible for technical investigation. Digs into logs, dbt runs, Supabase clones to pinpoint the root cause of the data issue. | | Communications Lead | Responsible for external and internal updates. Posts updates in #growth-ops and leadership channel every 15 minutes, managing stakeholder expectations. | | Scribe | Responsible for documentation. Logs events, decisions, and follow-ups in the Notion incident doc, creating a comprehensive audit trail. | | Shadow | A rotating trainee who observes and learns the playbook. This role is vital for cross-training and building team resilience. |

Everyone rotates through these roles, ensuring that the process survives vacations, team changes, and builds a shared understanding of incident response.

Scenario Design

Effective fire drills begin with well-designed scenarios that accurately mimic real-world challenges:

  1. Write a press release: Describe what leadership sees (e.g., “ARR dashboard shows zero for EU”) and what assumptions they’ll likely make. This helps the team understand the business impact and urgency.
  2. Outline injected failure: Detail the specific data corruption or system failure to be simulated (e.g., run a dbt seed that multiplies revenue by 0.1, or delay a Fivetran sync for a critical table).
  3. Define success criteria: Establish clear, measurable objectives for the drill (e.g., detect within X minutes, fix within Y minutes, produce a comprehensive Root Cause Analysis (RCA) summary).
  4. Prepare artifacts: Create realistic supporting materials such as a corrupted CSV file, a fake PagerDuty alert, or simulated Slack messages from a "sales leader" asking for status updates.

We store these scenarios in Notion with difficulty ratings and reuse them annually, adapting them as our data stack evolves.

Drill Execution

The execution of a data quality fire drill is a structured process designed to maximize learning and minimize disruption:

  1. Kickoff: Send a calendar hold and context 24 hours prior, ensuring participants block time. Crucially, do not reveal the drill details beforehand – surprise matters for realistic simulation.
  2. Inject issue: Run automation that corrupts staging data or pauses an Airbyte sync. PagerDuty fires a simulated alert to the on-call team.
  3. Run response: The IC opens a dedicated Zoom room, assigns roles, and starts the incident timeline. The Comms Lead posts standardized updates (e.g., “T+5: investigating ingestion job xyz”).
  4. Escalate intentionally: Halfway through, we introduce a curveball (e.g., legal requests an immediate impact analysis) to test communication load and decision-making under pressure.
  5. Resolve + rollback: Once the root cause is identified, the team applies the fix to the staging environment, prepares production steps, and confirms metrics have recovered.
  6. Retro: An immediate debrief (15 min) captures what went well, what broke, and which runbooks or tests need to be updated. This feedback loop is critical for continuous improvement.

Tooling Details

Our tooling stack is designed to support realistic simulations and efficient incident response:

  • Supabase clones let us mess with data without touching production. Each drill uses a fresh clone with labeled timestamps, ensuring a clean slate for every simulation.
  • dbt seeds + macros generate corruption (duplicate rows, nulls, schema drift). We version these scenarios so we can reproduce them later and track improvements.
  • PagerDuty has a "drill" service with a fake on-call schedule. Alerts include runbook links, so responders practice using them in a simulated environment.
  • Notion incident template collects the timeline, metrics impact, screenshots, Slack links, and follow-up tasks. The Scribe fills it live, creating a comprehensive record.
  • Metabase dashboard tracks detection time, resolution time, and participants per drill to measure improvement over time and identify areas for further training.

Communication Templates

Standardized communication templates are vital for clear, concise, and timely updates during incidents:

Channel update (Comms lead)
T+10 — Investigating dashboards/growth ARR showing 0 for EU.
Suspect: freshness delay in `revenue_daily` table (last load T-26h).
Next update in 15 min. IC: @marina, Detective: @daniel.

Leadership knows to expect updates every 15 minutes during drills and real incidents, fostering trust and reducing anxiety.

Automation for Drills

Automation streamlines the setup and execution of fire drills, ensuring consistency and reducing manual overhead:

  • n8n playbook selects a scenario, spins up a Supabase clone, runs a dbt corruption macro, and fires a PagerDuty alert. This automates the "injection" phase of the drill.
  • Slack bot posts direct messages to participants: “Drill in progress. Join Zoom link. Roles: ...” This ensures everyone is quickly informed and assigned their roles.
  • GitHub action opens a dummy Pull Request (PR) representing the fix. Reviewers practice code review under time pressure, simulating a real-world hotfix scenario.

Automation keeps setup consistent and makes it easy to run drills even when I’m on vacation, ensuring the program's continuity.

Measuring Progress

We rigorously measure the effectiveness of our data quality fire drills to demonstrate their value and drive continuous improvement:

  • Time to detect real incidents: Reduced by 40% (measured using Metabase + PagerDuty timestamps), indicating improved vigilance and alert tuning.
  • Mean time to communicate: Went from 18 min → 6 min for the first leadership update, reflecting better communication protocols.
  • Runbooks updated: 100% of drills result in at least one improved document or test, ensuring our playbooks are always current.
  • Confidence surveys: Participants rate their readiness afterward; the average jumped from 6/10 to 9/10, indicating increased team confidence.

We share these metrics in our monthly ops review to prove the drills pay off and secure continued buy-in from leadership.

Sample Scenario: Duplicate Leads

This scenario highlights how a drill can address a specific, common data quality issue:

  1. Injection: Inject duplicate contacts into a Supabase clone via a dbt seed.
  2. Impact: A fake HubSpot workflow sends two welcome emails to the same "lead"; marketing sees a spike in unsubscribes and a skewed CAC.
  3. Objective: Detect the duplication, identify the root cause (e.g., a misconfigured enrichment job), and patch the workflow.
  4. Success criteria: Incident opened <45 min, fix <90 min, follow-up PR to add a uniqueness test + monitoring.

This scenario helped marketing, data, and RevOps practice collaborating instead of pointing fingers, fostering a culture of shared responsibility.

Retrospective Template

Our post-drill retrospectives are blameless and action-oriented, using a structured template to capture key learnings:

We answer five prompts:

  1. What signals alerted us? (Which metrics/alerts fired, were they clear and actionable?)
  2. Where did we lose time? (Was it tooling, access, decision paralysis, or unclear roles?)
  3. Which runbooks/tests need updates? (Identify specific documentation or automated tests to improve.)
  4. What would leadership/customers have felt? (Empathize with stakeholders to understand the real-world impact.)
  5. What experiment or automation will we ship before next drill? (Commit to concrete improvements.)

Answers turn into Jira tickets with owners and due dates, ensuring that learnings are translated into tangible actions.

Cultural Buy-In

Achieving cultural buy-in for data quality fire drills requires proactive communication and a focus on team empowerment:

  • Exec briefing: Before starting the program, we informed leadership to expect simulated alerts so they wouldn’t escalate to the board during a drill. This managed expectations and built trust.
  • Gamification: We keep a leaderboard showing “fastest detection” and “best comms” to keep morale high and encourage healthy competition.
  • Inclusion: We rotate in PMMs, engineers, and even finance team members so everyone learns how data incidents ripple across the organization.

After three drills, people started volunteering because they saw how calm real incidents became, transforming a chore into a valuable team activity.

Scheduling Cadence & Variations

To maintain readiness and adapt to evolving risks, we employ a varied scheduling cadence:

  • Monthly core drill: Always covers ingestion/data warehouse stack, focusing on fundamental data quality issues.
  • Quarterly “black swan”: Simulate a cloud outage, vendor lockout, or permissions issue – forcing us to test backup pipelines and data export plans for extreme scenarios.
  • Shadow pager: Once per month, a new teammate carries the on-call phone for 24 hours (with supervision) to build confidence and practical experience.
  • Async tabletop: A remote-friendly version where we walk through a scenario via Notion timelines instead of live data injection, suitable for distributed teams.

We alternate office hours-style drills with surprise ones so nobody can script their response. The key is consistency – skip one month and the muscle memory fades fast.

What stuck with me

  • Warn leadership ahead of time to avoid unnecessary escalations. Proactive communication is key to managing expectations and building trust, especially when simulating incidents.
  • Document immediately or the drill evaporates from memory. The Scribe role and Notion logbook are critical for capturing learnings in real-time, ensuring they are not lost.
  • Incidents are team sports—run drills cross-functionally, not just within data. Data quality impacts everyone, so involving diverse teams fosters empathy and a shared sense of responsibility.

Cost Snapshot

The investment in data quality fire drills is minimal compared to the cost of a single, unmanaged data incident.

  • Supabase (Pro Plan): ~$25/month (for disposable staging clones).
  • dbt Cloud (Developer License): ~$100/month (for managing data transformations and injecting corrupt seeds).
  • PagerDuty (Essentials Plan): ~$200/month (for simulated alerts and on-call rotations).
  • Notion (Team Plan): ~$10/month (for incident logbooks and scenario documentation).
  • n8n (self-hosted on Fly.io): ~$20/month (for automating drill setup and injection).
  • Engineering/Data Team Time: Approximately 0.5-1 day per month for scenario design, drill execution, and retro.

The total incremental tooling cost is less than $400/month. This is a small price to pay for preventing costly data outages, improving decision-making, and building a resilient data culture. One avoided misinformed marketing campaign or sales forecast easily covers this annual cost.

FAQ

Q: How do you ensure the drills are realistic without causing actual damage? A: We use isolated Supabase clones for staging environments, ensuring that any data corruption or system failures are contained and do not impact production. All alerts are routed through a dedicated "drill" service in PagerDuty, clearly marked as simulations.

Q: What if a team member is new and unfamiliar with the process? A: New team members participate as "Shadows" first, observing and learning from experienced responders. We also provide SRE 101 workshops and encourage shadowing shifts before anyone carries the pager independently.

Q: How do you prevent alert fatigue from the simulated alerts? A: Drill alerts are clearly distinguished from production alerts (e.g., different PagerDuty service, specific Slack channel). We also ensure drills are scheduled and communicated in advance (without revealing the scenario) to manage expectations.

Q: What happens if a drill uncovers a real production issue? A: If a drill accidentally uncovers a real production issue, the drill is immediately paused, and the team transitions into a real incident response. This is rare but highlights the value of continuous vigilance.

Q: How do you get buy-in from busy executives for these drills? A: We present the drills as a proactive measure to reduce the financial and reputational risks of data incidents. We share metrics on reduced incident response times and improved data quality, demonstrating a clear ROI. Briefing them beforehand about simulated alerts also helps manage expectations.

What I'm building next

I'm publishing a library of ready-made scenarios (CSV corruption, reverse ETL outage, GA4 quota issues) plus the n8n scripts that inject them. This will allow other teams to easily adopt and run their own data quality fire drills. I'm also exploring integrating AI-powered anomaly detection directly into our dbt tests, providing even earlier warnings of potential data issues. Want to contribute or get early access to these resources? DM me.


Want me to help you replicate this module? Drop me a note and we’ll build it together.

Marsala OS

Ready to turn this insight into a live system?

We build brand, web, CRM, AI, and automation modules that plug into your stack.

Talk to our team