MarsalaMarsala
Back to articles
TutorialJan 20, 2025·9 min read

Turn PostHog into a Mini CDP

I use PostHog, Kafka and Resend to activate audiences without a six-figure CDP.

Tutorial for structuring events, streaming into Kafka and activating channels directly from PostHog.

By Marina Álvarez·
#Analytics#Growth#Automation

Turn PostHog into a Mini CDP

I hacked PostHog until it behaved like a mini CDP and saved us $60k a year.

Context

We needed real-time audiences for lifecycle emails, ads exclusions, and product nudges, but commercial Customer Data Platform (CDP) pricing (Segment Personas, mParticle, Braze) didn’t fit our budget or our agile development philosophy. Our existing setup, while collecting events via PostHog, lacked the crucial activation layer that would allow us to segment users and trigger personalized experiences across various channels. This meant our marketing and product teams were constantly struggling with fragmented data, manual audience exports, and delayed campaign launches, leading to missed opportunities and inefficient spend.

Recognizing these limitations, I embarked on a mission to extend PostHog's capabilities, transforming it into a "mini CDP." The core idea was to leverage PostHog's robust event collection and cohorting features and augment them with a scalable streaming architecture (Kafka), a low-latency audience store (Firestore), and a set of activation microservices. The goal was ambitious: unify tracking and user traits, build dynamic cohorts, score people based on their behavior, and activate these audiences across email, paid media, and in-product messaging—all without incurring the six-figure costs of a traditional CDP. This bespoke solution not only saved us significant annual costs but also provided unparalleled flexibility and control over our customer data activation strategy.

Stack I leaned on

  • PostHog Cloud (events, cohorts, feature flags): PostHog serves as the foundation of our mini CDP, collecting all raw events from our web and mobile applications. Its built-in cohorting and feature flag capabilities are leveraged for initial segmentation and in-product personalization.
  • Kafka (Confluent Cloud) streaming destination: Kafka acts as the central nervous system for our event data. PostHog streams all events into Kafka topics, providing a highly scalable, fault-tolerant, and real-time data pipeline that decouples event ingestion from processing.
  • Cloud Run consumers (TypeScript) validating and transforming payloads: Google Cloud Run hosts our TypeScript microservices that consume events from Kafka. These consumers perform critical tasks such as schema validation, enrichment with CRM traits (e.g., from Attio), PII masking, and normalization before storing the data in our audience stores.
  • Firestore for low-latency audience storage: Firestore is chosen for its real-time capabilities and low-latency queries, making it ideal for storing dynamic audience segments and user traits that need to be accessed quickly by activation services.
  • Supabase for long-term profiles and consent states: Supabase provides a robust PostgreSQL database with built-in authentication and Row Level Security (RLS). It stores comprehensive, long-term user profiles, historical event data, and crucial consent states, serving as our auditable source of truth.
  • Resend + Meta Marketing API for activation (email + paid): Resend is integrated for sending highly personalized lifecycle emails, leveraging our React Email component library. The Meta Marketing API is used to ingest hashed audiences for targeted advertising campaigns on Facebook and Instagram.
  • n8n orchestrating QA + refresh jobs: n8n is our workflow automation tool, used for orchestrating various operational tasks within the mini CDP. This includes scheduling weekly QA scripts to validate audience accuracy and triggering daily refreshes of audience segments.

Architecture Overview

The PostHog Mini CDP architecture is designed for real-time event processing, flexible audience segmentation, and multi-channel activation.

  1. Tracking plan → PostHog collects canonical events/properties from all product surfaces (web, mobile, backend). This ensures a standardized and consistent data input.
  2. Kafka destination → PostHog streams all collected events into Kafka topics (one per product surface, e.g., web_events, app_events). Kafka provides a durable, scalable buffer for our event stream.
  3. Consumers → Cloud Run functions (TypeScript microservices) consume events from Kafka. They validate schemas, enrich events with CRM traits (e.g., from Attio), mask PII, and push normalized events to Firestore (for real-time access) and Supabase (for historical storage and consent management).
  4. Audience service → A dedicated Cloud Run service builds dynamic segments based on SQL-like filters defined in YAML. It queries Firestore for real-time data and exposes REST endpoints for activation services.
  5. Activation → Resend uses Firestore queries to fetch email audiences and send personalized sequences. The Meta Marketing API ingests hashed audiences for targeted advertising. Our application uses PostHog cohorts and feature flags to drive in-product messaging and experiments.
  6. Observability → Grafana monitors Kafka lag, consumer health, and pipeline uptime. Notion runbooks capture schemas and troubleshooting steps. QA scripts run weekly to ensure cohorts behave as expected.
PostHog  →  Kafka topics  →  Cloud Run consumer  →  Firestore (realtime)
                                         ↘ Supabase (historical + consent)
Firestore → Audience API → Resend / Meta / App flags

This architecture ensures a continuous flow of data from event capture to multi-channel activation, all while maintaining data integrity and privacy.

Event + Trait Design

A well-defined event and trait design is fundamental to the effectiveness of any CDP.

  • 12 canonical events: We standardized on 12 core events that represent key user actions and lifecycle stages (e.g., Signup Started, Signup Completed, Trial Activated, Feature Used, Payment Failed). This ensures consistency across all tracking.
  • Stable IDs and context: Each event contains stable identifiers (user_id, account_id), contextual information (e.g., plan, channel), and PII tokens (hashed versions of sensitive data).
  • Traits synced via PostHog Identify + backend enrichment: User traits (e.g., ARR, health_score, CSM_owner) are synced via PostHog's identify calls and enriched from our CRM.
  • Traits stored in Supabase with RLS: Comprehensive user traits are stored in Supabase tables with Row Level Security (RLS) for granular access control. PostHog receives hashed versions of PII for privacy.

Playbook

Building and maintaining the PostHog Mini CDP involved a systematic playbook:

  1. Define tracking plan with owners, schema, and property validations (dbt + scripts). This ensures data quality at the source and clear accountability.
  2. Enable Kafka destination in PostHog; configure topics per product surface (web, app, billing, support). This establishes the real-time event stream.
  3. Build consumer (Cloud Run microservice) that:
    • Validates payloads against JSON schema, ensuring data integrity.
    • Enriches with CRM data (Attio API) for a unified customer view.
    • Masks PII (hash emails, drop sensitive fields) for privacy compliance.
    • Writes normalized events to Firestore (for real-time queries) and Supabase (for historical analytics and consent).
  4. Sync CRM attributes daily via PostHog’s /api/person/ to keep flags and cohorts aware of sales state, ensuring up-to-date segmentation.
  5. Audience builder microservice reads Firestore, applies filters (SQL-like DSL defined in YAML), and caches results for low-latency access.
  6. Activation connectors:
    • Resend: daily job fetches activation_ready == true audiences from Firestore and sends personalized email sequences.
    • Meta Marketing API: hashed audiences exported via Cloud Run, with TTL (Time To Live) to comply with platform policies.
    • PostHog cohorts: automatically update using computed properties; reused in feature flags and experiments for in-product personalization.
  7. QA + monitoring:
    • Weekly script generates synthetic users to ensure flows update the right traits/cohorts, validating audience accuracy.
    • Grafana monitors Kafka lag, consumer failures, and audience sizes versus expected, providing real-time pipeline health.
    • Alerts route to #cdp-ops Slack for immediate notification of issues.

Key Principles of a Mini CDP

  • Event-driven architecture: All customer interactions are captured as events and streamed through a central, scalable pipeline.
  • Single source of truth for customer data: Unify event data and user traits into a consistent, accessible profile.
  • Real-time audience segmentation: Build dynamic cohorts based on behavior and traits, available for immediate activation.
  • Privacy and security by design: Implement PII tokenization, granular access controls, and consent management from the outset.
  • Modular and extensible: Design the system with interchangeable components, allowing for easy integration with new activation channels or data sources.
  • Cost-effective activation: Leverage existing tools and open-source components to achieve CDP capabilities without prohibitive licensing fees.
  • Automated QA and observability: Implement continuous monitoring and testing to ensure data quality, pipeline health, and audience accuracy.

Common Failure Modes (and Fixes)

  1. Kafka Schema Drift:
    • Problem: Engineering teams add new event fields or modify existing ones without updating the Kafka schema or consumer logic, leading to data parsing errors and pipeline failures.
    • Fix: Implement schema registry (e.g., Confluent Schema Registry) and enforce schema validation in CI/CD. Block PRs that introduce schema changes without corresponding consumer updates. Use automated tests to validate event payloads against expected schemas.
  2. Audience Freshness Spikes:
    • Problem: During peak traffic or large data backfills, Firestore caches can become stale, leading to delayed audience activation or incorrect segmentation.
    • Fix: Implement change streams from Kafka to Firestore for real-time updates. Add TTL (Time To Live) to Firestore documents to ensure regular refreshes. Monitor audience freshness metrics in Grafana and set alerts for deviations.
  3. PII Compliance Issues:
    • Problem: Mishandling of Personally Identifiable Information (PII) can lead to privacy breaches, regulatory fines, and loss of customer trust.
    • Fix: Implement PII tokenization before data enters Kafka (e.g., hash emails with SHA256 + salt). Use Supabase Row Level Security (RLS) to restrict access to raw PII. Integrate consent management (e.g., OneTrust) to enforce channel opt-outs automatically.
  4. Team Adoption and Usability:
    • Problem: Marketing and growth teams may find the technical stack (Kafka, Firestore, YAML DSL) too complex, leading to low adoption or reliance on engineering for every audience change.
    • Fix: Build a user-friendly interface (e.g., a Notion front-end that writes to Git via GitHub Actions) for defining audiences. Provide clear documentation and training. Emphasize the benefits of self-service audience building.
  5. Pipeline Uptime and Reliability:
    • Problem: The distributed nature of the mini CDP (PostHog, Kafka, Cloud Run, Firestore) can lead to complex debugging and potential downtime if not properly monitored.
    • Fix: Implement end-to-end observability with Grafana dashboards monitoring Kafka lag, consumer errors, and microservice health. Set up PagerDuty alerts for critical failures. Develop clear runbooks for common incident types.

Audience DSL

Our Audience Domain Specific Language (DSL) is defined in YAML, making it human-readable and version-controlled.

Example audience definition stored in Git:

audience: upsell_ready
description: "Customers nearing plan limits who engaged with AI module"
filters:
  - type: property
    field: usage_pct
    operator: ">="
    value: 0.8
  - type: event_count
    event: ai_module_used
    window_days: 14
    operator: ">="
    value: 3
  - type: property
    field: plan
    operator: "not_in"
    value: ["Enterprise"]
activation:
  - channel: resend
    template: upsell-ai
  - channel: meta
    campaign: AI-Upsell-Lookalike
expiry_days: 30
owner: @marina

The microservice interprets this YAML, builds Firestore queries, and registers the audience in PostHog for reuse, ensuring consistency and automation.

Audience Examples

The flexibility of our Audience DSL allows for the creation of diverse and highly targeted segments:

| Audience | Logic | Destination | |----------|-------|-------------| | Trial Engaged | event:feature_used >= 3 AND days_since_signup < 10 | Product nudges + CS follow-up | | Churn Risk | no_login_14d AND plan = Pro AND support_tickets >= 2 | CSM Slack alert + email | | Upsell Ready | usage_pct > 80% OR new_feature_flag = true | Meta lookalike, account-based ads | | Do Not Target | arr >= 50k AND deal_stage = Commit | Ads exclusion, email suppression |

All logic lives in Git (JSON DSL) so we can code review changes, ensuring accuracy and preventing unintended segmentation.

Security & Privacy

Security and privacy are paramount when dealing with customer data.

  • PII tokenization before Kafka: Emails and other sensitive PII are hashed with SHA256 + salt before entering the Kafka stream. This ensures raw PII never resides in the streaming pipeline.
  • Supabase row-level security: Supabase RLS ensures only specific, authorized services can access raw data, and only for the data they are permitted to see.
  • Consent states synced: Consent states from our CMP (e.g., OneTrust) are synced to Supabase and then to Firestore to automatically enforce channel opt-outs, ensuring compliance with privacy regulations.
  • Audit logs: Comprehensive audit logs are stored for every activation job (audience name, destination, timestamp, payload count), providing a clear trail for compliance and debugging.

Metrics & Telemetry

The PostHog Mini CDP has delivered significant improvements across several key metrics:

  • Contextual email CTR: +46%, demonstrating the power of personalized, behavior-driven email campaigns.
  • Paid CAC by excluding active users: -28%, achieved by efficiently suppressing active users from retargeting campaigns, optimizing ad spend.
  • Feature adoption on flagged cohorts: +19%, indicating that in-product nudges and experiments driven by the mini CDP are effectively guiding users.
  • Audience freshness: 5 minutes p95 (from event to Firestore), ensuring that our audience segments are nearly real-time.
  • Pipeline uptime: 99.4% (tracked via Grafana), reflecting the reliability of our distributed architecture.
  • QA pass rate: 100% for the last nine weekly runs, validating the accuracy of our audience definitions and activation logic.
  • Cost savings: ~$60k annually compared to commercial CDPs.

Incident Response

A clear incident response plan is crucial for maintaining the reliability of our mini CDP:

  • If Kafka lag > 5 minutes or consumer errors spike, PagerDuty alerts the on-call data engineer.
  • Runbook instructs to pause activation jobs, replay Kafka offsets, and backfill Firestore using the Supabase history table.
  • After resolution, we document the root cause (usually schema mismatch) and open PRs to update validation rules or consumer logic.

Change Management

Effective change management ensures the mini CDP evolves with our business needs and maintains high adoption:

  • Weekly audience review: Marketing and growth teams triage new audience requests, review performance, and retire unused ones, ensuring the audience library remains relevant.
  • Monthly privacy check: Legal reviews hashed exports and consent logs to ensure ongoing compliance with privacy regulations.
  • Versioned releases: Each audience update merges via PR; a changelog posts to #cdp-ops automatically, providing transparency and auditability.

Cost Breakdown

The PostHog Mini CDP offers significant cost savings compared to commercial CDPs while providing comparable functionality.

  • PostHog Cloud Scale + additional events: ~$400/month (covers event ingestion, cohorting, and feature flags).
  • Kafka (Confluent Cloud) with modest throughput: ~$120/month (for scalable event streaming).
  • Cloud Run + Firestore + Supabase: ~$180/month combined (for microservices, real-time audience storage, and historical profiles).
  • Resend + Meta Marketing API usage: Variable, ~$90/month average (for email and paid media activation).
  • n8n (self-hosted on Fly.io): ~$20/month (for workflow orchestration and QA jobs).

Total incremental cost is less than $800/month, representing a substantial saving of over $60,000 annually compared to commercial CDP solutions. This cost-effectiveness allows us to invest more in other growth initiatives.

FAQ

Q: How does this mini CDP compare to a full-fledged commercial CDP? A: Our mini CDP provides core CDP functionalities like unified customer profiles, real-time segmentation, and multi-channel activation at a fraction of the cost. While it might lack some advanced features of enterprise CDPs (e.g., complex identity resolution across many disparate systems), it's highly customized to our specific needs and offers greater flexibility.

Q: What if we need to integrate a new activation channel? A: The modular architecture makes it straightforward to add new activation channels. We would develop a new Cloud Run microservice that consumes from our Audience API and integrates with the new channel's API (e.g., a new SMS provider or push notification service).

Q: How do you ensure data quality and schema consistency across the pipeline? A: We enforce a strict tracking plan with schema validation at the PostHog ingestion layer. Our Kafka consumers perform additional schema validation against JSON schemas. Automated QA scripts run weekly to verify audience accuracy, and Grafana monitors pipeline health for any anomalies.

Q: Is it difficult for non-technical users (e.g., marketers) to build audiences? A: While the underlying audience DSL is YAML-based, we've built a user-friendly interface (e.g., a Notion front-end that writes to Git via GitHub Actions) that allows marketers to define audiences without writing code. This empowers them to self-serve their segmentation needs.

Q: How do you handle data privacy regulations like GDPR or CCPA? A: Privacy is built-in. We tokenize PII before it enters Kafka, use Supabase RLS for granular access control, and sync consent states from our CMP to automatically enforce opt-outs across all activation channels. All activation jobs are auditable.

Roadblocks & Fixes

  • Kafka schema drift: Initial deployments failed when engineering added new event fields without updating the Kafka schema or consumer logic. Fixed by checking PRs against JSON schema and blocking merges when validation fails.
  • Audience freshness spikes: Firestore caches grew stale during big launches. Added change streams from Kafka to Firestore + TTL to automatically refresh when new events arrive.
  • PII compliance: Hashed contact IDs with rotating salts stored in Vault; nightly job rehashes exports for ad platforms to ensure compliance.
  • Team adoption: Marketers feared YAML. Built a Notion front-end that writes to the repo via GitHub Actions so they can request audiences without touching code.

Lessons Learned

  • Without governance PostHog becomes a junk drawer. A clear tracking plan and schema validation are crucial for data quality.
  • Tokenize PII before sending it anywhere; privacy isn’t optional. Proactive privacy measures are essential for compliance and trust.
  • Keep audiences declarative and versioned; marketing loves Git history. Treating audience definitions as code enables collaboration and auditability.
  • Monitoring is mandatory; treat the pipeline like production infra. Comprehensive observability ensures reliability and rapid incident response.

What I'm building next

I'll open-source part of the pipeline (Kafka consumer, audience DSL, QA scripts). This will allow other teams to build their own cost-effective mini CDPs. I'm also exploring integrating AI-powered anomaly detection directly into the Kafka consumers to identify data quality issues even earlier in the pipeline. Want the repo link when it's ready? let me know.


Want me to help you replicate this module? Drop me a note and we’ll build it together.

Marsala OS

Ready to turn this insight into a live system?

We build brand, web, CRM, AI, and automation modules that plug into your stack.

Talk to our team