Handling Refresh Failures with PL/pgSQL Triggers

When a continuous aggregate background job fails, the fix is to capture the failure into a tracking table and let a BEFORE INSERT OR UPDATE PL/pgSQL trigger drive a deterministic retry state machine with exponential backoff — so stale rollups self-heal instead of silently accumulating lag.

TimescaleDB materializes continuous aggregates through scheduled background jobs, but those jobs are not immune to transient failures. Network partitions, lock contention with an ingestion path, out-of-memory conditions, or a malformed materialization window can all interrupt an incremental refresh. When one fails, downstream IoT dashboards, alerting pipelines, and retention sweeps operate on incomplete data. TimescaleDB’s scheduler will retry a failed job, but its default policy is coarse: it does not classify errors, cap attempts per materialization window, or coordinate with your data retention automation. This guide builds a custom failure-handling layer that does. It is part of the broader error handling and retry mechanisms for the continuous aggregate lifecycle, and it assumes you already run a continuous aggregate refresh policy in production.

Failure-record state machine enforced by the BEFORE INSERT OR UPDATE trigger and the retry worker.

Input Profiling: What to Gather Before You Build

The retry policy is only as good as the signals you feed it. Before writing a single trigger, profile the failure surface of your refresh jobs:

Baseline job identity — the job_id and hypertable_name of every continuous aggregate policy, from timescaledb_information.jobs.
Historical failure rate — how often each job has failed, and with which SQLSTATE, from timescaledb_information.job_errors. A job that fails on lock timeout needs a different backoff curve than one failing on OOM.
Materialization window size — the start_offset and end_offset of each policy. These bound how much raw data a single refresh touches and therefore how long a retry runs.
Retention horizon — the drop_after interval of any retention policy on the same hypertable. Retries must complete before raw chunks are dropped, or the aggregate can never reconcile the missing window.
Acceptable staleness SLA — the maximum refresh lag the business tolerates. This sets your total retry budget: attempts multiplied by backoff must fit inside the SLA.

Capture these into a short table before you size the backoff. The retention horizon and the SLA are the two hard constraints; everything else is tuning.

Environment prerequisites

TimescaleDB 2.10+ (improved job scheduler semantics and watermark tracking)
PostgreSQL 14+
CREATE, EXECUTE, and ALTER on the target schema
Read access to timescaledb_information.jobs, job_stats, and job_errors
An active refresh policy created with add_continuous_aggregate_policy()

The Backoff Calculation

The core of the retry mechanism is a deterministic backoff schedule. Given a base interval $b$ and the current attempt count $n$ (zero-indexed), the delay before the next retry is:

t_{next} = t_{failed} + b \cdot 2^{n}

With a base of 5 minutes, attempts fall at 5, 10, 20, and 40 minutes after each successive failure. The total wall-clock budget consumed by $N$ attempts is the geometric sum:

T_{budget} = b \cdot (2^{N} - 1)

So four attempts at a 5-minute base span $5 \cdot (2^{4}-1) = 75$ minutes of retry window before the failure is escalated. Size $b$ and max_retries so that $T_{budget}$ stays comfortably inside your staleness SLA and well inside the retention horizon.

TimescaleDB exposes no native AFTER REFRESH FAILURE trigger, so the production-standard pattern is to record failures into a tracking table and attach a PL/pgSQL trigger that computes $t_{next}$ on every insert or re-arm:

sql

-- Idempotent schema and tracking table creation
CREATE SCHEMA IF NOT EXISTS ca_monitor;

CREATE TABLE IF NOT EXISTS ca_monitor.refresh_failures (
    failure_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    job_id INTEGER NOT NULL,
    hypertable_name TEXT NOT NULL,
    continuous_agg_name TEXT NOT NULL,
    failed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    error_message TEXT,
    retry_count INTEGER NOT NULL DEFAULT 0,
    max_retries INTEGER NOT NULL DEFAULT 3,
    next_retry_at TIMESTAMPTZ,
    status TEXT NOT NULL DEFAULT 'PENDING',
    CONSTRAINT chk_status CHECK (status IN ('PENDING', 'SCHEDULED_RETRY', 'RETRYING', 'MAX_RETRIES_EXCEEDED', 'RESOLVED'))
);

A lightweight scheduled job polls timescaledb_information.job_errors for recent failures and inserts a row per failed run. The BEFORE INSERT OR UPDATE trigger then intercepts each row and applies the backoff formula before the transaction commits:

sql

CREATE OR REPLACE FUNCTION ca_monitor.handle_refresh_failure()
RETURNS TRIGGER AS $$
DECLARE
    backoff_minutes INTEGER;
    backoff_interval INTERVAL;
BEGIN
    -- Only process rows in the PENDING state (fresh failure or re-armed retry)
    IF NEW.status = 'PENDING' THEN
        IF NEW.retry_count < NEW.max_retries THEN
            -- Exponential backoff: base 5 minutes * 2^retry_count
            backoff_minutes := POWER(2, NEW.retry_count)::INTEGER * 5;
            backoff_interval := (backoff_minutes || ' minutes')::INTERVAL;

            NEW.next_retry_at := NEW.failed_at + backoff_interval;
            NEW.retry_count := NEW.retry_count + 1;
            NEW.status := 'SCHEDULED_RETRY';
        ELSE
            NEW.status := 'MAX_RETRIES_EXCEEDED';
            -- Emit async notification for external alerting systems
            PERFORM pg_notify('ca_refresh_failure_alert', row_to_json(NEW)::text);
        END IF;
    END IF;

    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Fire on INSERT (initial capture) and UPDATE (re-arming after a failed retry),
-- so each transition back to 'PENDING' advances the backoff state machine.
CREATE OR REPLACE TRIGGER trg_ca_refresh_failure
BEFORE INSERT OR UPDATE ON ca_monitor.refresh_failures
FOR EACH ROW EXECUTE FUNCTION ca_monitor.handle_refresh_failure();

The retry executor is a Python worker that polls due rows, calls refresh_continuous_aggregate, and re-arms the row on failure by flipping its status back to PENDING — which re-fires the trigger and advances the backoff step:

python

import psycopg
from psycopg.rows import dict_row

def process_pending_retries(dsn: str) -> None:
    # refresh_continuous_aggregate() cannot run inside a transaction block,
    # so the connection runs in autocommit mode.
    with psycopg.connect(dsn, row_factory=dict_row, autocommit=True) as conn:
        with conn.cursor() as cur:
            # Fetch failures whose backoff window has elapsed.
            cur.execute("""
                SELECT failure_id, continuous_agg_name, retry_count
                FROM ca_monitor.refresh_failures
                WHERE status = 'SCHEDULED_RETRY'
                  AND next_retry_at <= NOW()
                ORDER BY next_retry_at ASC
                LIMIT 5;
            """)
            pending = cur.fetchall()

            for job in pending:
                try:
                    # Rematerialize the whole range; the view name is bound safely.
                    cur.execute(
                        "CALL refresh_continuous_aggregate(%s, NULL, NULL);",
                        (job["continuous_agg_name"],),
                    )
                    cur.execute("""
                        UPDATE ca_monitor.refresh_failures
                        SET status = 'RESOLVED'
                        WHERE failure_id = %s;
                    """, (job["failure_id"],))

                except Exception as e:
                    # Re-arm: status -> PENDING fires the trigger, which increments
                    # retry_count and reschedules (eventually MAX_RETRIES_EXCEEDED).
                    cur.execute("""
                        UPDATE ca_monitor.refresh_failures
                        SET status = 'PENDING', error_message = %s, failed_at = NOW()
                        WHERE failure_id = %s;
                    """, (str(e), job["failure_id"]))

Worked Example: A 5,000-Device Telemetry Fleet

Consider a fleet of 5,000 industrial sensors writing one row per second into iot_metrics_hypertable, feeding an iot_metrics_hourly continuous aggregate. The refresh policy uses a start_offset of 48 hours and an end_offset of 1 hour, and the business tolerates at most 2 hours of dashboard staleness.

Plug the constraints into the budget formula with a 5-minute base:

max_retries = 3 → attempts land at 5, 10, and 20 minutes after each failure.
$T_{budget} = 5 \cdot (2^{3}-1) = 35$ minutes of retry window — well inside the 2-hour SLA.
The escalation on the fourth failure fires pg_notify for a P1 page while there is still ~85 minutes of SLA headroom for a human to intervene.

Because the aggregate reads a 48-hour window, raw chunks must survive far longer than the retry budget. Pair the policy with a retention window that leaves ample reconciliation time before drop_chunks runs:

sql

-- Retain raw data for 90 days: far longer than the 48-hour start_offset,
-- giving the retry loop ample room to reconcile before chunks disappear.
SELECT add_retention_policy(
    'iot_metrics_hypertable',
    drop_after => INTERVAL '90 days',
    if_not_exists => TRUE
);

At 5,000 rows/second the 48-hour window covers roughly 864 million rows, so an unbounded retry that keeps calling a full refresh_continuous_aggregate would be self-defeating under sustained lock contention. This is exactly why the state machine caps attempts and escalates rather than looping forever — the tradeoffs between rematerializing the full range versus a bounded window are covered in incremental versus full refresh strategies.

Edge Cases and When to Deviate

Correlated, non-transient failures. If every attempt fails on the same SQLSTATE (e.g. a permissions error or a broken window), exponential backoff just delays the inevitable. Branch the trigger on error_message and fast-fail non-retryable classes straight to MAX_RETRIES_EXCEEDED.
Retry budget exceeds the retention horizon. If $T_{budget}$ plus the materialization lag approaches drop_after, raw chunks can vanish mid-retry. Cap max_retries or widen the retention window — see TTL policy mapping and enforcement.
Thundering herd after an outage. When many jobs fail simultaneously (a failover, a full disk), synchronized backoff makes them all retry at once. Add jitter — a random 0–30% offset on next_retry_at — to spread the load.
Long-running refreshes. If a single refresh legitimately runs longer than the base interval, the worker can pick up a row that is already RETRYING elsewhere. Guard with a RETRYING status transition and a lock_timeout, and consider routing heavy jobs through the asynchronous execution queue.
Real-time aggregates masking staleness. With real-time aggregation enabled, queries stitch live raw data over the stale materialized range, so a failing refresh may go unnoticed until compression removes the raw tail. Treat MAX_RETRIES_EXCEEDED as an incident regardless of what dashboards show.

Production Hardening

Idempotency guarantees. The BEFORE INSERT trigger makes each state transition atomic. Deduplicate concurrent failure inserts with ON CONFLICT or an application-level job lock.
Observability integration. The pg_notify channel ca_refresh_failure_alert can be consumed by a Prometheus exporter or Grafana alert rule. Map status = 'MAX_RETRIES_EXCEEDED' to P1 incident routing.
Scheduler alignment. Never edit internal catalog tables directly. Use the public alter_job() function to widen refresh intervals when failure frequency rises.
Resource guardrails. Set statement_timeout and lock_timeout in the refresh wrapper to prevent runaway transactions during ingestion spikes. See the PostgreSQL manual on PL/pgSQL exception handling for structured error trapping.

Verification

Confirm the state machine is behaving as designed by inspecting the tracking table alongside TimescaleDB’s own job views. First, check that no failure is stuck past its scheduled retry — a row in SCHEDULED_RETRY with next_retry_at in the past means the worker is not running:

sql

SELECT failure_id, continuous_agg_name, status, retry_count,
       next_retry_at, NOW() - next_retry_at AS overdue_by
FROM ca_monitor.refresh_failures
WHERE status = 'SCHEDULED_RETRY'
  AND next_retry_at < NOW()
ORDER BY overdue_by DESC;

Then reconcile your tracking table against the source of truth, so a job that is quietly failing in TimescaleDB but missing from ca_monitor surfaces immediately:

sql

SELECT j.job_id, j.hypertable_name,
       js.last_run_status, js.last_successful_finish,
       js.total_failures
FROM timescaledb_information.jobs j
JOIN timescaledb_information.job_stats js USING (job_id)
WHERE j.proc_name = 'policy_refresh_continuous_aggregate'
  AND js.last_run_status = 'Failed'
ORDER BY js.total_failures DESC;

If total_failures climbs while ca_monitor.refresh_failures stays empty, your failure-capture poll is not keeping up — tighten its schedule before the staleness SLA is breached.

Frequently Asked Questions

Why not just rely on TimescaleDB’s built-in job retry?

The built-in scheduler retries failed jobs, but it does not classify error types, cap attempts per materialization window, emit alerts, or coordinate with retention. The custom trigger turns an opaque retry into an observable, bounded state machine you can page on.

Does the trigger add write overhead to my hypertable?

No. The trigger fires only on ca_monitor.refresh_failures, which receives one row per failed refresh — not on the telemetry hypertable itself. Ingestion is untouched.

Can I run the retry worker as a background job instead of external Python?

Yes. You can schedule the retry logic as a TimescaleDB user-defined action via add_job, but refresh_continuous_aggregate cannot run inside a transaction block, so a psycopg worker in autocommit mode remains the simplest reliable executor.

Up: ← Error Handling & Retry Mechanisms · Continuous Aggregate Creation & Refresh Management

Handling Refresh Failures with PL/pgSQL Triggers

# Input Profiling: What to Gather Before You Build

# The Backoff Calculation

# Worked Example: A 5,000-Device Telemetry Fleet

# Edge Cases and When to Deviate

# Production Hardening

# Verification