Asynchronous Execution and Queue Management for Continuous Aggregates

TimescaleDB refreshes continuous aggregates asynchronously through a bounded pool of background workers, and when due refresh jobs outnumber the available worker slots the queue saturates and aggregate freshness silently falls behind ingestion. This guide shows time-series data engineers, IoT platform developers, and Python automation builders how the internal job scheduler dispatches refresh jobs, how to size the worker pool so jobs never starve, and how to detect and clear backpressure before it reaches a dashboard. It sits inside the broader continuous aggregate creation and refresh management lifecycle and focuses narrowly on the execution layer that turns a registered policy into materialized rollups.

The core problem is contention for a shared, fixed resource. Every refresh policy, every compression job, and every retention job on the instance draws from the same timescaledb.max_background_workers budget. Synchronous aggregation would block application connections and produce unpredictable query spikes; asynchronous execution moves that compute off the foreground path, but it introduces eventual consistency and a queue that must be sized, monitored, and tuned deliberately.

The scheduler orders due jobs by next_start and hands each to a free worker; the pool size is capped by timescaledb.max_background_workers, so surplus due jobs wait until a slot frees.

Prerequisites

Asynchronous refresh depends on the TimescaleDB job scheduler being able to launch a background worker at the moment a job’s next_start comes due. That in turn depends on PostgreSQL worker-process limits, on the extension being loaded into shared memory, and on at least one continuous aggregate already carrying a refresh policy. Confirm each of the following before tuning the queue:

TimescaleDB 2.10 or later with CREATE EXTENSION timescaledb run in the target database, and the extension present in shared_preload_libraries so the scheduler daemon starts when the PostgreSQL server boots.
PostgreSQL 14 or later, so the planner rewrites that union materialized rollups with the raw tail behave as documented.
max_worker_processes set high enough to cover timescaledb.max_background_workers plus PostgreSQL’s own parallel-query, autovacuum, and replication workers — the scheduler cannot launch a job if the process table is exhausted.
timescaledb.max_background_workers sized for the total job population (refresh + compression + retention), not just one aggregate. Both this and max_worker_processes require a full PostgreSQL restart; pg_reload_conf() does not apply them.
At least one continuous aggregate exists with a registered refresh policy, created through the materialized view architecture and syntax and scheduled through a refresh policy design and scheduling definition.
The connecting role owns the aggregate views (required to register, alter, or manually run their jobs) and has SELECT on timescaledb_information views for monitoring.
Python automation runs on Python 3.11+ with psycopg v3 installed if you intend to drive the health checks in this guide.

A useful sizing heuristic for the worker budget is to reserve one slot per concurrently firing background job plus headroom for the scheduler itself:

\texttt{max\_background\_workers} \ge n_{\text{cagg}} + n_{\text{compression}} + n_{\text{retention}} + 2

The trailing +2 covers the scheduler daemon and a spare slot so a long-running refresh cannot deadlock the whole instance out of worker capacity.

Step-by-step: from policy to materialized rollup

The nodes in the diagram above — the jobs catalog, the scheduler, the worker pool, and the materialization step — map directly onto the operational steps below.

1. Inspect the current worker budget

Before changing anything, read what the instance actually has. max_worker_processes is the PostgreSQL-wide ceiling; timescaledb.max_background_workers is the slice TimescaleDB may consume from it.

sql

SHOW max_worker_processes;
SHOW timescaledb.max_background_workers;

2. Size and apply the worker pool

Apply the sizing formula and commit the change. Because both parameters are read only at process start, this is a restart operation — plan it into a maintenance window rather than expecting pg_reload_conf() to pick it up.

sql

-- Both settings are read at startup only; a PostgreSQL restart is required.
ALTER SYSTEM SET max_worker_processes = 24;
ALTER SYSTEM SET timescaledb.max_background_workers = 12;

-- After the restart, confirm the new ceilings took effect:
-- SHOW timescaledb.max_background_workers;

3. Register the refresh job on the queue

A continuous aggregate does not enter the queue until a policy is attached. add_continuous_aggregate_policy inserts a row into the jobs catalog with a proc_name of policy_refresh_continuous_aggregate; the scheduler picks it up on the next tick. Keep registration idempotent so CI/CD can re-run it safely.

sql

-- Idempotent registration. Re-running this in a deploy pipeline will not
-- create duplicate jobs for the same aggregate.
SELECT add_continuous_aggregate_policy(
    'sensor_metrics_1h',
    start_offset      => INTERVAL '3 hours',
    end_offset        => INTERVAL '1 hour',
    schedule_interval => INTERVAL '30 minutes',
    if_not_exists     => true
);

4. Observe the scheduler dispatching the job

Each due job — ordered by its next_start timestamp — is dispatched to a free worker. A refresh only materializes its own invalidated ranges, so well-spaced policies avoid redundant I/O and excessive WAL. Watch the job move through the catalog and its statistics update after it runs.

sql

SELECT job_id, application_name, schedule_interval, next_start
FROM timescaledb_information.jobs
WHERE proc_name = 'policy_refresh_continuous_aggregate'
ORDER BY next_start;

5. Force a run to validate the path end to end

run_job executes a registered job synchronously in the current session, bypassing the scheduler. Use it to prove the materialization path works and to drain a specific job during an incident — but never call it for a job that is already running, or you will contend on the same materialization hypertable.

sql

-- Run job 1000 now, in this session, instead of waiting for next_start.
CALL run_job(1000);

Five jobs become due at staggered next_start times, but the two-slot pool runs only two at once: A and B execute while C, D and E form a backlog, each starting as a running job frees its slot.

Configuration parameters reference

These are the levers that govern queue throughput and worker behaviour for asynchronous aggregate refresh.

Parameter	Type	Recommended value	Effect
`timescaledb.max_background_workers`	integer (restart)	`n_cagg + n_compression + n_retention + 2`	Ceiling on TimescaleDB-managed workers. Too low and due jobs queue behind occupied slots.
`max_worker_processes`	integer (restart)	`max_background_workers` + parallel + replication + autovacuum	PostgreSQL-wide worker ceiling. If exhausted, the scheduler cannot launch any job.
`schedule_interval`	interval (per policy)	`30 min`–`1 h` for hourly rollups	How often the job becomes due. Shorter intervals raise queue pressure.
`initial_start`	timestamptz (per policy)	staggered per aggregate	Anchors the phase of the schedule so multiple aggregates do not all fire at the top of the hour.
`max_runtime`	interval (per policy)	`0` (unlimited) or a bound above p99 duration	Caps a single run; on breach the job is stopped and retried, freeing the slot.
`retry_period`	interval (per policy)	`5 min`	Backoff base after a failed run before the scheduler re-queues the job.
`work_mem`	memory (per session)	large enough to avoid disk spills on the refresh sort/hash	Undersized `work_mem` spills merges to disk and inflates refresh duration, holding the slot longer.

Stagger initial_start across your policies so their next_start times do not collide. If ten aggregates all fire at :00, they contend for worker slots simultaneously even when the instance is idle the rest of the hour.

sql

-- Shift one aggregate's schedule phase so it does not fire on the same tick
-- as its siblings. alter_job updates the catalog row in place.
SELECT alter_job(
    (SELECT job_id FROM timescaledb_information.jobs
     WHERE proc_name = 'policy_refresh_continuous_aggregate'
       AND hypertable_name = '_materialized_hypertable_5'),
    next_start => now() + INTERVAL '7 minutes'
);

Integration with adjacent features

The queue does not exist in isolation — it competes with, and depends on, the rest of the lifecycle. The schedule_interval, start_offset, and end_offset set by your refresh policy design and scheduling determine how often jobs enter the queue and how much data each one touches. A tight five-minute interval with a short end offset queues many small jobs; a longer interval queues fewer, heavier jobs. Neither is universally correct — it is a throughput-versus-freshness trade shaped by your ingestion rate.

Whether a job does incremental work or a full re-materialization is decided by your choice of incremental versus full refresh strategies. Full refreshes hold a worker slot far longer and are the most common cause of a single job blocking the queue; reserve them for controlled maintenance windows. When a run does fail, the scheduler’s automatic retry is only the first line of defence — durable recovery belongs to error handling and retry mechanisms, which cover PL/pgSQL guards and dead-letter patterns.

The same worker budget is shared with the data retention, compression, and lifecycle automation jobs. A chunk compression scheduling automation policy that fires at the same instant as a heavy refresh will fight it for slots, so include compression and retention jobs in your worker-count math. Finally, refresh cost tracks the physical layout of the source hypertable: chunk sizing decided under your core hypertable architecture and partitioning strategy sets how many chunks each invalidated window must scan, and columnar compression models for high-frequency telemetry change the decompression overhead a refresh pays when its window overlaps compressed chunks. For terabyte-scale tuning of the refresh itself, the incremental refresh performance tuning for large datasets guide goes deep on index pruning, WAL reduction, and merge-memory allocation.

Performance validation

Confirm the queue is healthy by querying the TimescaleDB system views directly. Start with per-job statistics: job_stats reports the outcome, duration, and failure count of every scheduled job.

sql

-- Refresh jobs, worst offenders first. A last_run_status other than 'Success'
-- or a job_status of 'Scheduled' with a next_start far in the past signals a
-- stalled or starved queue.
SELECT s.job_id,
       s.last_run_status,
       s.job_status,
       s.last_run_duration,
       s.total_runs,
       s.total_failures,
       s.next_start
FROM timescaledb_information.job_stats s
JOIN timescaledb_information.jobs j USING (job_id)
WHERE j.proc_name = 'policy_refresh_continuous_aggregate'
ORDER BY s.total_failures DESC, s.last_run_duration DESC NULLS LAST;

To confirm workers are actually starving rather than idle, compare the number of due jobs against the configured ceiling. If due jobs consistently exceed available slots, the queue is saturated and refresh lag will grow.

sql

-- Count jobs already due but not yet finished this cycle versus the ceiling.
SELECT
  (SELECT count(*) FROM timescaledb_information.jobs
   WHERE next_start <= now()) AS jobs_due_now,
  current_setting('timescaledb.max_background_workers')::int AS worker_ceiling;

You can also watch live execution in pg_stat_activity, where running refresh jobs appear with a recognisable application name and query text.

sql

SELECT pid, application_name, state, wait_event_type, wait_event, query_start
FROM pg_stat_activity
WHERE application_name LIKE 'Refresh Continuous Aggregate Policy%'
ORDER BY query_start;

Automated queue health check in Python

For continuous monitoring, poll job_stats on a schedule and alert when failure counts cross a threshold, optionally draining a specific stalled job with run_job. This uses psycopg v3’s async API.

python

import asyncio
import logging

import psycopg
from psycopg.rows import dict_row

async def check_aggregate_queue(dsn: str, failure_threshold: int = 3) -> list[dict]:
    """Return refresh jobs that have exceeded the failure threshold, and
    attempt a single synchronous drain of each via run_job()."""
    async with await psycopg.AsyncConnection.connect(dsn, row_factory=dict_row) as conn:
        async with conn.cursor() as cur:
            # proc_name lives on jobs; run stats live on job_stats. Join on job_id.
            await cur.execute(
                """
                SELECT s.job_id, s.last_run_status, s.job_status,
                       s.total_failures, s.next_start
                FROM timescaledb_information.job_stats s
                JOIN timescaledb_information.jobs j USING (job_id)
                WHERE j.proc_name = 'policy_refresh_continuous_aggregate'
                  AND s.last_run_status <> 'Success'
                  AND s.total_failures > %s
                """,
                (failure_threshold,),
            )
            stalled = await cur.fetchall()

            if stalled:
                logging.warning("Queue saturation: %d stalled refresh job(s)", len(stalled))
                for row in stalled:
                    # Only drain jobs that are not currently executing.
                    if row["job_status"] != "Running":
                        await cur.execute("CALL run_job(%s)", (row["job_id"],))
                        logging.info("Drained refresh job %s", row["job_id"])
            return stalled

# asyncio.run(check_aggregate_queue("postgresql://user:pass@host:5432/telemetry"))

Troubleshooting

Jobs are Scheduled but their next_start keeps slipping into the past. The worker pool is fully occupied. Every slot in timescaledb.max_background_workers is held by a running job, so newly due jobs wait. Raise the worker ceiling per the sizing formula, and check for a single long-running full refresh hogging a slot. Confirm with the jobs_due_now versus worker_ceiling query above.

FATAL: sorry, too many clients already or the scheduler logs “out of background worker slots”. max_worker_processes is exhausted before TimescaleDB gets its share. Parallel query, autovacuum, and replication workers all draw from the same pool. Increase max_worker_processes above the sum of every worker class and restart.

A refresh job errors with deadlock detected or canceling statement due to lock timeout. Two jobs are materializing overlapping ranges, or a manual run_job collided with the scheduled run. Stagger initial_start so schedules do not coincide, and never call run_job on a job whose job_status is Running. Inspect the exact error in timescaledb_information.job_errors.

Aggregate is stale even though the job reports Success. The policy’s start_offset/end_offset window excludes the data you expected, or the job is materializing a different range than you think. This is a scheduling-window issue rather than a queue issue; see refresh policy design and scheduling for offset alignment, and query job_errors to rule out silent partial failures.

Setting timescaledb.max_background_workers had no effect. The parameter is read only at process start. ALTER SYSTEM SET ... followed by pg_reload_conf() updates the file but not the running value; a full PostgreSQL restart is required. Verify with SHOW timescaledb.max_background_workers; after the restart.

Frequently Asked Questions

Why is my continuous aggregate falling behind even though every job reports success?

Success only means the job completed the range it was told to refresh. If jobs are due more often than the pool can run them, each run succeeds but the backlog grows — the queue is saturated. Compare jobs_due_now to your worker ceiling and either raise timescaledb.max_background_workers, lengthen schedule_interval, or stagger initial_start.

Does timescaledb.max_background_workers take effect without a restart?

No. Both it and max_worker_processes are read at process start. ALTER SYSTEM SET plus pg_reload_conf() writes the value to postgresql.auto.conf but the running scheduler keeps the old ceiling until a full restart.

Is it safe to call run_job on a policy that is currently scheduled?

Calling run_job on a job that is not currently executing is safe and runs it synchronously in your session. Calling it on a job whose job_status is Running risks lock contention on the same materialization hypertable, so gate manual drains on the job not already running.

How many workers does one continuous aggregate refresh consume?

One background-worker slot per firing job. A refresh may additionally spawn PostgreSQL parallel workers for its internal sort or hash, which draw from max_parallel_workers, not from the TimescaleDB background-worker budget — size both.

Should compression and retention jobs count against the refresh worker budget?

Yes. Compression, retention, and refresh jobs all share the same timescaledb.max_background_workers pool. Size the ceiling for the total scheduled job population, or a compression run firing on the same tick as a heavy refresh will starve the queue.

Refresh Policy Design & Scheduling — set the schedule_interval and offsets that govern when jobs enter this queue.
Materialized View Architecture & Syntax — how the rollups a worker materializes are stored and queried.
Incremental vs Full Refresh Strategies — decide how much work each queued job performs.
Error Handling & Retry Mechanisms — durable recovery when a queued refresh fails.
Incremental Refresh Performance Tuning for Large Datasets — terabyte-scale tuning of the refresh each worker runs.

← Back to Continuous Aggregate Creation & Refresh Management

Asynchronous Execution and Queue Management for Continuous Aggregates

# Prerequisites

# Step-by-step: from policy to materialized rollup

# 1. Inspect the current worker budget

# 2. Size and apply the worker pool

# 3. Register the refresh job on the queue

# 4. Observe the scheduler dispatching the job

# 5. Force a run to validate the path end to end

# Configuration parameters reference

# Integration with adjacent features

# Performance validation

# Automated queue health check in Python

# Troubleshooting