Change Data Capture for Self-Hosted Supabase: Build Real-Time Data Pipelines

Learn how to set up CDC pipelines for self-hosted Supabase using logical replication, Supabase ETL, and Debezium.

Cover Image for Change Data Capture for Self-Hosted Supabase: Build Real-Time Data Pipelines

When you self-host Supabase, you gain full control over your data—but that control comes with responsibility. One critical capability that teams often overlook until they need it is Change Data Capture (CDC): the ability to stream database changes in real-time to external systems like analytics warehouses, search engines, or event buses. Unlike Supabase Cloud, where replication features are managed for you, self-hosted deployments require manual configuration to enable CDC pipelines.

This guide walks you through setting up CDC for your self-hosted Supabase instance, covering PostgreSQL logical replication fundamentals, Supabase ETL, and integration with tools like Debezium and Kafka.

Why CDC Matters for Self-Hosted Supabase

Traditional data synchronization relies on periodic batch exports or API polling—approaches that introduce latency and place unnecessary load on your production database. CDC takes a fundamentally different approach by reading directly from PostgreSQL's Write-Ahead Log (WAL), capturing every INSERT, UPDATE, and DELETE as it happens.

For self-hosted Supabase teams, CDC enables several important workflows:

  • Analytics synchronization: Stream operational data to BigQuery, Snowflake, or data lakes without impacting production performance
  • Search index updates: Keep Elasticsearch or Meilisearch indices synchronized in real-time
  • Event-driven architectures: Trigger downstream microservices when specific database changes occur
  • Audit logging: Capture complete change history for compliance requirements
  • Cross-region replication: Maintain read replicas or disaster recovery instances

The challenge? Self-hosted Supabase doesn't include managed CDC out of the box. You need to configure it yourself.

Understanding PostgreSQL Logical Replication

Before diving into implementation, it's worth understanding how PostgreSQL CDC actually works. Every change to your database is first written to the WAL—a sequential log that ensures durability. Logical replication decodes these WAL entries into a format that external systems can consume.

Three components make this work:

  1. Publications: Define which tables should have their changes captured
  2. Replication slots: Track which WAL entries have been consumed, preventing PostgreSQL from deleting unread segments
  3. Output plugins: Convert binary WAL data into readable formats (pgoutput, wal2json)

Here's how to enable logical replication on your self-hosted Supabase PostgreSQL instance:

-- Check current wal_level (must be 'logical' for CDC)
SHOW wal_level;

-- If not 'logical', update postgresql.conf and restart
ALTER SYSTEM SET wal_level = 'logical';

-- Create a publication for specific tables
CREATE PUBLICATION my_cdc_publication 
FOR TABLE users, orders, products;

-- Or capture all tables
CREATE PUBLICATION all_changes FOR ALL TABLES;

After changing wal_level, you'll need to restart PostgreSQL. For Docker-based deployments, this means restarting the supabase-db container.

Setting Up Supabase ETL for BigQuery

Supabase released their ETL framework—a Rust-based CDC pipeline that streams changes to analytical destinations. It's particularly well-suited for self-hosted deployments because it's lightweight and designed specifically for the Supabase architecture.

Prerequisites

Before configuring Supabase ETL, ensure your self-hosted instance meets these requirements:

  • PostgreSQL 14 or newer (15+ recommended for advanced features)
  • wal_level set to logical
  • Sufficient disk space for WAL retention
  • Network connectivity to your destination (BigQuery, S3, etc.)

Configuration Steps

First, add the ETL service to your deployment. If you're using Docker Compose for production, add the ETL container:

etl:
  image: supabase/etl:latest
  environment:
    - DATABASE_URL=postgres://postgres:${POSTGRES_PASSWORD}@db:5432/postgres
    - DESTINATION_TYPE=bigquery
    - BIGQUERY_PROJECT_ID=your-project-id
    - BIGQUERY_DATASET=supabase_replica
    - BIGQUERY_CREDENTIALS_PATH=/secrets/gcp-key.json
  volumes:
    - ./secrets/gcp-key.json:/secrets/gcp-key.json:ro
  depends_on:
    - db

Then create the replication slot and publication:

-- Create a dedicated replication slot for ETL
SELECT pg_create_logical_replication_slot(
  'supabase_etl_slot', 
  'pgoutput'
);

-- Create publication for tables you want to replicate
CREATE PUBLICATION etl_publication 
FOR TABLE public.users, public.orders
WITH (publish = 'insert, update, delete, truncate');

Monitoring Replication Lag

One critical aspect of CDC that catches teams off guard: replication slots can cause disk bloat if consumers fall behind. PostgreSQL retains WAL segments until the slot confirms they've been processed. Monitor this carefully:

-- Check replication slot status
SELECT 
  slot_name,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag,
  active
FROM pg_replication_slots;

-- Set a maximum WAL retention size (PostgreSQL 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
SELECT pg_reload_conf();

If your ETL consumer goes offline for extended periods, this setting prevents runaway disk usage—though it means you might miss some changes if the slot's WAL is truncated.

Alternative: Debezium for Kafka Integration

While Supabase ETL works well for direct warehouse integration, many teams prefer Debezium for event-driven architectures. Debezium reads PostgreSQL's WAL and publishes changes to Apache Kafka, enabling complex downstream processing.

Debezium Configuration for Self-Hosted Supabase

Add Debezium to your stack with Kafka Connect:

connect:
  image: debezium/connect:2.5
  ports:
    - "8083:8083"
  environment:
    - BOOTSTRAP_SERVERS=kafka:9092
    - GROUP_ID=1
    - CONFIG_STORAGE_TOPIC=connect_configs
    - OFFSET_STORAGE_TOPIC=connect_offsets
    - STATUS_STORAGE_TOPIC=connect_statuses
  depends_on:
    - kafka
    - db

Configure the PostgreSQL connector:

{
  "name": "supabase-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "db",
    "database.port": "5432",
    "database.user": "postgres",
    "database.password": "${POSTGRES_PASSWORD}",
    "database.dbname": "postgres",
    "plugin.name": "pgoutput",
    "publication.name": "debezium_publication",
    "slot.name": "debezium_slot",
    "topic.prefix": "supabase",
    "table.include.list": "public.users,public.orders",
    "heartbeat.interval.ms": "10000"
  }
}

The heartbeat.interval.ms setting is particularly important for self-hosted deployments—it ensures the replication slot advances even when there are no changes, preventing unnecessary WAL retention.

Handling Schema Changes

One limitation of CDC that requires careful planning: schema changes (DDL) aren't automatically replicated. When you add columns or modify tables, your downstream systems need to adapt.

For Supabase ETL, DDL support is currently in development. In the meantime, implement a migration workflow:

  1. Apply schema changes to your Supabase database
  2. Update downstream schemas (BigQuery, etc.) to match
  3. The ETL pipeline continues without interruption

For Debezium, you can capture DDL events by enabling include.schema.changes:

{
  "include.schema.changes": "true",
  "schema.history.internal.kafka.topic": "schema-changes"
}

This publishes schema changes to a dedicated Kafka topic, allowing downstream consumers to react programmatically.

Performance Considerations

CDC adds minimal overhead to your Supabase database—it reads from WAL rather than querying tables directly. However, there are several configuration decisions that impact performance:

Batch sizing: Both Supabase ETL and Debezium support batching changes before sending to destinations. Larger batches improve throughput but increase latency:

# Supabase ETL
BATCH_WAIT_MS=1000
BATCH_SIZE_ROWS=10000

Filtering: Don't replicate tables you don't need. Every replicated table adds to WAL decoding overhead:

-- Only publish specific tables
CREATE PUBLICATION analytics_pub 
FOR TABLE orders, products, revenue_events;

-- Filter specific columns (PostgreSQL 15+)
CREATE PUBLICATION user_pub 
FOR TABLE users (id, email, created_at);

-- Filter specific rows (PostgreSQL 15+)
CREATE PUBLICATION active_users_pub 
FOR TABLE users WHERE (status = 'active');

Connection pooling: CDC consumers maintain dedicated connections for replication. Ensure your self-hosted Supabase has sufficient max_replication_slots and max_wal_senders:

-- Check current settings
SHOW max_replication_slots;
SHOW max_wal_senders;

-- Increase if needed (requires restart)
ALTER SYSTEM SET max_replication_slots = 10;
ALTER SYSTEM SET max_wal_senders = 10;

Simplifying CDC with Supascale

Setting up CDC pipelines manually requires deep PostgreSQL knowledge and ongoing maintenance. Supascale simplifies self-hosted Supabase management by providing a unified control plane for your instances.

While CDC configuration still requires PostgreSQL expertise, Supascale's monitoring capabilities help you track replication health across all your self-hosted projects. Combined with automated backups, you get a complete operational picture of your Supabase deployment.

For teams that need analytics integration but want to minimize operational complexity, consider whether your use case truly requires real-time CDC or if scheduled pg_dump exports to your data warehouse might suffice. The true cost of self-hosting includes not just infrastructure but the engineering time to maintain these pipelines.

Conclusion

Change Data Capture transforms your self-hosted Supabase database from an isolated data store into a real-time event source. Whether you're building analytics pipelines with Supabase ETL, event-driven architectures with Debezium, or custom integrations using raw logical replication, the foundation remains the same: PostgreSQL's battle-tested WAL-based replication.

The key decisions for your implementation:

  • Supabase ETL for direct BigQuery/data lake integration with minimal setup
  • Debezium + Kafka for complex event routing and multiple consumers
  • Raw logical replication for custom consumers or simpler use cases

Start with a single table, monitor replication lag carefully, and scale up as you validate the pipeline's reliability. CDC is powerful, but it's also another system to maintain—make sure the value justifies the operational investment.


Further Reading