I needed to build a system that fires outbound HTTP callbacks (webhooks/postbacks) to hundreds of different endpoints, tracks delivery success and failure, measures per-endpoint latency percentiles, retries intelligently, and quarantines endpoints that are consistently slow or broken.
This is the kind of infrastructure that looks like a weekend project and becomes a three-month war. Every webhook delivery system I’ve seen in production has the same failure mode: it works fine at low volume, then degrades silently at scale because nobody instrumented the delivery pipeline itself.
This article walks through the full implementation — queue architecture, retry logic with exponential backoff, per-endpoint latency tracking with P50/P95/P99 percentiles, circuit breaker pattern for failing endpoints, and a dead letter queue for permanent failures. All code is Python with PostgreSQL. No external message broker required (though I’ll note where you’d swap in Redis or RabbitMQ for higher throughput).
The Data Model
Everything starts with two tables: one for the delivery queue and one for the latency measurements.
sql
-- Pending and in-flight webhook deliveries
CREATE TABLE webhook_queue (
id SERIAL PRIMARY KEY,
endpoint_id INTEGER NOT NULL,
endpoint_url TEXT NOT NULL,
payload JSONB NOT NULL,
-- Delivery state
status VARCHAR(20) NOT NULL DEFAULT 'pending',
-- pending, in_flight, delivered, failed, dead_letter
attempt_count INTEGER NOT NULL DEFAULT 0,
max_attempts INTEGER NOT NULL DEFAULT 5,
-- Timing
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
next_attempt_at TIMESTAMP NOT NULL DEFAULT NOW(),
delivered_at TIMESTAMP,
last_error TEXT,
-- Response tracking
last_status_code INTEGER,
last_response_ms INTEGER, -- response time in milliseconds
INDEX idx_queue_next (status, next_attempt_at)
WHERE status IN ('pending', 'failed')
);
-- Per-endpoint latency measurements (ring buffer)
CREATE TABLE endpoint_latency (
id SERIAL PRIMARY KEY,
endpoint_id INTEGER NOT NULL,
response_ms INTEGER NOT NULL,
status_code INTEGER,
measured_at TIMESTAMP NOT NULL DEFAULT NOW(),
INDEX idx_latency_endpoint (endpoint_id, measured_at)
);
-- Endpoint health state (circuit breaker)
CREATE TABLE endpoint_health (
endpoint_id INTEGER PRIMARY KEY,
endpoint_url TEXT NOT NULL,
state VARCHAR(20) NOT NULL DEFAULT 'closed',
-- closed (healthy), open (broken), half_open (testing)
consecutive_failures INTEGER NOT NULL DEFAULT 0,
failure_threshold INTEGER NOT NULL DEFAULT 10,
last_failure_at TIMESTAMP,
last_success_at TIMESTAMP,
opened_at TIMESTAMP, -- when circuit opened
cooldown_seconds INTEGER NOT NULL DEFAULT 300, -- 5 min before half_open
-- Latency stats (updated periodically)
p50_ms INTEGER,
p95_ms INTEGER,
p99_ms INTEGER,
sample_count INTEGER NOT NULL DEFAULT 0
);
Three things to note about this schema.
First, the webhook_queue table uses a next_attempt_at column instead of a separate scheduling mechanism. The worker polls for rows where status IN ('pending', 'failed') AND next_attempt_at <= NOW(). This is a poor man’s delay queue and it works fine up to about 10,000 deliveries per minute. Beyond that, swap in a proper message broker.
Second, the endpoint_latency table acts as a ring buffer. I periodically purge rows older than 24 hours. The latency percentiles in endpoint_health are calculated from this rolling window — they represent recent behavior, not historical averages.
Third, the endpoint_health table implements the circuit breaker state machine. More on this below.
The Delivery Worker
The core worker loop is deliberately simple. Complexity belongs in the retry logic and circuit breaker, not in the delivery path itself.
python
import requests
import time
import psycopg2
from psycopg2.extras import RealDictCursor
from datetime import datetime, timedelta
DB_DSN = "postgresql://user:pass@localhost/webhooks"
def get_connection():
return psycopg2.connect(DB_DSN)
def deliver_webhooks(batch_size=50):
"""
Fetch pending webhooks and attempt delivery.
Uses SELECT FOR UPDATE SKIP LOCKED for safe concurrent workers.
"""
conn = get_connection()
cur = conn.cursor(cursor_factory=RealDictCursor)
try:
cur.execute("""
SELECT id, endpoint_id, endpoint_url, payload,
attempt_count, max_attempts
FROM webhook_queue
WHERE status IN ('pending', 'failed')
AND next_attempt_at <= NOW()
ORDER BY next_attempt_at ASC
LIMIT %s
FOR UPDATE SKIP LOCKED
""", (batch_size,))
rows = cur.fetchall()
for row in rows:
# Check circuit breaker before attempting
if is_circuit_open(cur, row['endpoint_id']):
# Don't attempt delivery — reschedule for after cooldown
reschedule_for_cooldown(cur, row['id'], row['endpoint_id'])
continue
# Attempt delivery and measure latency
result = attempt_delivery(
row['endpoint_url'],
row['payload']
)
# Record latency measurement regardless of success/failure
record_latency(
cur,
row['endpoint_id'],
result['response_ms'],
result['status_code']
)
if result['success']:
mark_delivered(cur, row['id'], result)
record_success(cur, row['endpoint_id'])
else:
handle_failure(
cur, row['id'], row['endpoint_id'],
row['attempt_count'], row['max_attempts'],
result
)
conn.commit()
except Exception as e:
conn.rollback()
raise
finally:
cur.close()
conn.close()
def attempt_delivery(url, payload):
"""
Fire the webhook and measure response time.
Returns dict with success, status_code, response_ms, error.
"""
start = time.monotonic()
try:
response = requests.post(
url,
json=payload,
timeout=15, # 15 second hard timeout
headers={
'Content-Type': 'application/json',
'User-Agent': 'WebhookDelivery/1.0',
'X-Delivery-Timestamp': str(int(time.time()))
}
)
elapsed_ms = int((time.monotonic() - start) * 1000)
return {
'success': 200 <= response.status_code < 300,
'status_code': response.status_code,
'response_ms': elapsed_ms,
'error': None if response.ok else f"HTTP {response.status_code}"
}
except requests.Timeout:
elapsed_ms = int((time.monotonic() - start) * 1000)
return {
'success': False,
'status_code': None,
'response_ms': elapsed_ms,
'error': 'timeout_15s'
}
except requests.ConnectionError as e:
elapsed_ms = int((time.monotonic() - start) * 1000)
return {
'success': False,
'status_code': None,
'response_ms': elapsed_ms,
'error': f'connection_error: {str(e)[:200]}'
}
The SELECT FOR UPDATE SKIP LOCKED clause is critical for running multiple worker instances. Without SKIP LOCKED, two workers would block on the same row. With it, each worker grabs a different batch of pending webhooks. This gives you horizontal scaling by simply starting more worker processes.
The time.monotonic() call instead of time.time() is deliberate. time.time() can jump backward during NTP adjustments. time.monotonic() never goes backward, which matters when you’re measuring sub-second latency.
Retry Logic with Exponential Backoff and Jitter
When a delivery fails, the retry timing determines whether your system recovers gracefully or creates a thundering herd that hammers a struggling endpoint.
python
import random
def calculate_next_attempt(attempt_count, base_delay=30, max_delay=3600):
"""
Exponential backoff with full jitter.
Attempt 1: 0-30s
Attempt 2: 0-60s
Attempt 3: 0-120s
Attempt 4: 0-240s
Attempt 5: 0-480s (capped at max_delay)
Full jitter prevents thundering herd when an endpoint
recovers and hundreds of retries fire simultaneously.
"""
exponential_delay = base_delay * (2 ** attempt_count)
capped_delay = min(exponential_delay, max_delay)
jittered_delay = random.uniform(0, capped_delay)
return datetime.utcnow() + timedelta(seconds=jittered_delay)
def handle_failure(cur, webhook_id, endpoint_id,
attempt_count, max_attempts, result):
"""
Handle a failed delivery attempt.
Either retry with backoff or move to dead letter queue.
"""
new_attempt_count = attempt_count + 1
if new_attempt_count >= max_attempts:
# Exhausted retries — dead letter
cur.execute("""
UPDATE webhook_queue
SET status = 'dead_letter',
attempt_count = %s,
last_error = %s,
last_status_code = %s,
last_response_ms = %s
WHERE id = %s
""", (
new_attempt_count, result['error'],
result['status_code'], result['response_ms'],
webhook_id
))
else:
# Schedule retry with backoff
next_attempt = calculate_next_attempt(new_attempt_count)
cur.execute("""
UPDATE webhook_queue
SET status = 'failed',
attempt_count = %s,
next_attempt_at = %s,
last_error = %s,
last_status_code = %s,
last_response_ms = %s
WHERE id = %s
""", (
new_attempt_count, next_attempt,
result['error'], result['status_code'],
result['response_ms'], webhook_id
))
# Update circuit breaker
record_failure(cur, endpoint_id)
Why full jitter instead of decorrelated jitter or equal jitter? AWS published the definitive analysis on this. Full jitter (randomizing between 0 and the exponential cap) produces the lowest total completion time across all clients. Equal jitter (randomizing between half the cap and the full cap) is more conservative but slower to drain the retry backlog. For webhook delivery where you have many independent endpoints, full jitter is the right choice because each endpoint’s retries are independent — you’re not coordinating between them.
Circuit Breaker: Stop Hammering Broken Endpoints
The circuit breaker pattern prevents your system from wasting resources on endpoints that are consistently failing. Without it, a dead endpoint accumulates hundreds of pending retries that all timeout at 15 seconds each — burning your worker capacity on deliveries that will never succeed.
python
def is_circuit_open(cur, endpoint_id):
"""
Check if the circuit breaker is open (endpoint is broken).
If open and cooldown has passed, transition to half_open.
"""
cur.execute("""
SELECT state, opened_at, cooldown_seconds
FROM endpoint_health
WHERE endpoint_id = %s
""", (endpoint_id,))
row = cur.fetchone()
if not row:
return False # no health record = assume healthy
if row['state'] == 'closed':
return False
if row['state'] == 'open':
# Check if cooldown period has elapsed
if row['opened_at'] and row['cooldown_seconds']:
elapsed = (datetime.utcnow() - row['opened_at']).total_seconds()
if elapsed >= row['cooldown_seconds']:
# Transition to half_open — allow one probe
cur.execute("""
UPDATE endpoint_health
SET state = 'half_open'
WHERE endpoint_id = %s
""", (endpoint_id,))
return False # allow the probe delivery
return True # still in cooldown
if row['state'] == 'half_open':
return False # allow probe delivery
return False
def record_failure(cur, endpoint_id):
"""
Record a delivery failure. Open circuit if threshold reached.
"""
cur.execute("""
UPDATE endpoint_health
SET consecutive_failures = consecutive_failures + 1,
last_failure_at = NOW()
WHERE endpoint_id = %s
RETURNING consecutive_failures, failure_threshold, state
""", (endpoint_id,))
row = cur.fetchone()
if not row:
# Create health record on first failure
cur.execute("""
INSERT INTO endpoint_health (endpoint_id, endpoint_url,
consecutive_failures, last_failure_at)
VALUES (%s, '', 1, NOW())
ON CONFLICT (endpoint_id) DO UPDATE
SET consecutive_failures = endpoint_health.consecutive_failures + 1,
last_failure_at = NOW()
""", (endpoint_id,))
return
if row['state'] == 'half_open':
# Probe failed — re-open circuit with longer cooldown
cur.execute("""
UPDATE endpoint_health
SET state = 'open',
opened_at = NOW(),
cooldown_seconds = LEAST(cooldown_seconds * 2, 3600)
WHERE endpoint_id = %s
""", (endpoint_id,))
elif (row['state'] == 'closed' and
row['consecutive_failures'] >= row['failure_threshold']):
# Threshold reached — open circuit
cur.execute("""
UPDATE endpoint_health
SET state = 'open',
opened_at = NOW(),
cooldown_seconds = 300 -- reset to 5 minutes
WHERE endpoint_id = %s
""", (endpoint_id,))
def record_success(cur, endpoint_id):
"""
Record a delivery success. Close circuit if half_open.
"""
cur.execute("""
UPDATE endpoint_health
SET consecutive_failures = 0,
last_success_at = NOW(),
state = 'closed'
WHERE endpoint_id = %s
""", (endpoint_id,))
The half_open → open escalation with doubled cooldown is the detail most implementations miss. If an endpoint fails during the probe (half_open state), you don’t want to retry in another 5 minutes. The endpoint is still broken. Double the cooldown to 10 minutes, then 20, capping at 1 hour. This prevents the circuit breaker from becoming a periodic hammering mechanism.
Latency Percentile Tracking
Averages lie. An endpoint with an average response time of 200ms might respond in 50ms 95% of the time and 3,000ms the other 5%. The average looks fine. The P95 reveals a problem that affects 1 in 20 deliveries.
python
def record_latency(cur, endpoint_id, response_ms, status_code):
"""
Record a latency measurement and update percentile stats.
"""
cur.execute("""
INSERT INTO endpoint_latency
(endpoint_id, response_ms, status_code)
VALUES (%s, %s, %s)
""", (endpoint_id, response_ms, status_code))
def update_latency_percentiles(cur, endpoint_id, window_hours=24):
"""
Calculate P50, P95, P99 from the rolling window.
Uses PostgreSQL's percentile_cont for exact percentiles.
"""
cur.execute("""
SELECT
COUNT(*) as sample_count,
percentile_cont(0.50) WITHIN GROUP
(ORDER BY response_ms) AS p50,
percentile_cont(0.95) WITHIN GROUP
(ORDER BY response_ms) AS p95,
percentile_cont(0.99) WITHIN GROUP
(ORDER BY response_ms) AS p99
FROM endpoint_latency
WHERE endpoint_id = %s
AND measured_at >= NOW() - INTERVAL '%s hours'
""", (endpoint_id, window_hours))
row = cur.fetchone()
if row and row['sample_count'] > 0:
cur.execute("""
UPDATE endpoint_health
SET p50_ms = %s,
p95_ms = %s,
p99_ms = %s,
sample_count = %s
WHERE endpoint_id = %s
""", (
int(row['p50']), int(row['p95']),
int(row['p99']), row['sample_count'],
endpoint_id
))
return row
def get_slow_endpoints(cur, p95_threshold_ms=2000):
"""
Find endpoints whose P95 latency exceeds the threshold.
These are candidates for investigation or circuit opening.
"""
cur.execute("""
SELECT endpoint_id, endpoint_url,
p50_ms, p95_ms, p99_ms, sample_count,
state, consecutive_failures
FROM endpoint_health
WHERE p95_ms > %s
AND sample_count >= 20 -- need sufficient samples
ORDER BY p95_ms DESC
""", (p95_threshold_ms,))
return cur.fetchall()
PostgreSQL’s percentile_cont is an ordered-set aggregate function that computes exact percentiles. For large datasets, you’d switch to percentile_disc (which returns an actual observed value rather than interpolating) or use a t-digest approximation. For webhook delivery with a 24-hour window, exact percentiles on the raw data are fast enough up to about 100,000 measurements per endpoint.
The get_slow_endpoints function is what I run as a scheduled check every 15 minutes. Endpoints with P95 above 2 seconds get flagged for investigation. Endpoints with P95 above 5 seconds get their circuit breaker threshold reduced — they’re allowed fewer consecutive failures before the circuit opens, because each failed delivery ties up a worker thread for the full timeout duration.
Monitoring Webhook Delivery Health
Here’s the monitoring query I run every five minutes. It produces a single-row health summary of the entire delivery pipeline:
sql
SELECT
-- Queue depth
COUNT(*) FILTER (WHERE status = 'pending') AS pending,
COUNT(*) FILTER (WHERE status = 'failed') AS awaiting_retry,
COUNT(*) FILTER (WHERE status = 'in_flight') AS in_flight,
COUNT(*) FILTER (WHERE status = 'dead_letter') AS dead_letter,
-- Delivery rate (last hour)
COUNT(*) FILTER (
WHERE status = 'delivered'
AND delivered_at >= NOW() - INTERVAL '1 hour'
) AS delivered_last_hour,
-- Failure rate (last hour)
COUNT(*) FILTER (
WHERE status IN ('failed', 'dead_letter')
AND created_at >= NOW() - INTERVAL '1 hour'
) AS failed_last_hour,
-- Oldest undelivered
MIN(created_at) FILTER (
WHERE status IN ('pending', 'failed')
) AS oldest_pending,
-- Average delivery latency (last hour, successful only)
AVG(last_response_ms) FILTER (
WHERE status = 'delivered'
AND delivered_at >= NOW() - INTERVAL '1 hour'
) AS avg_delivery_ms_last_hour
FROM webhook_queue;
The oldest_pending value is the most important metric in this query. If it’s older than your maximum retry window (sum of all backoff delays), something is structurally wrong — either the worker is stuck, the endpoint is blackholed, or the queue is growing faster than you can drain it.
I alert on three conditions: dead letter count increasing (endpoints are permanently failing and nobody’s investigating), pending queue age exceeding 30 minutes (delivery is falling behind), and per-endpoint P95 latency exceeding thresholds that indicate degraded delivery reliability. The third one is the early warning signal — latency increases before failures do. An endpoint that was responding in 200ms and starts responding in 3 seconds is about to start timing out.
The Dead Letter Queue Isn’t Just Storage
Most teams implement a dead letter queue as a table where failed webhooks go to die. They check it occasionally during incident response. This is a waste.
The dead letter queue is your most valuable debugging dataset. Every row represents a delivery that your system tried multiple times and gave up on. The pattern of dead letters tells you things the success metrics never will.
python
def analyze_dead_letters(cur, hours=24):
"""
Analyze recent dead letter entries for patterns.
Returns per-endpoint failure analysis.
"""
cur.execute("""
SELECT
endpoint_id,
endpoint_url,
COUNT(*) AS dead_count,
-- Most common error
MODE() WITHIN GROUP (ORDER BY last_error) AS primary_error,
-- Most common status code
MODE() WITHIN GROUP (ORDER BY last_status_code)
AS primary_status_code,
-- Timing
MIN(created_at) AS first_dead,
MAX(created_at) AS last_dead,
-- Average attempts before giving up
AVG(attempt_count)::INTEGER AS avg_attempts
FROM webhook_queue
WHERE status = 'dead_letter'
AND created_at >= NOW() - INTERVAL '%s hours'
GROUP BY endpoint_id, endpoint_url
ORDER BY dead_count DESC
LIMIT 20
""", (hours,))
return cur.fetchall()
When I review dead letters, I’m looking for three patterns.
Cluster failures: 50 dead letters for the same endpoint in the same hour means the endpoint went down and didn’t recover within the retry window. Action: extend retry window or implement manual re-queue.
Status code patterns: A spike in 401/403 dead letters means the endpoint rotated credentials and nobody updated the webhook configuration. A spike in 429 (Too Many Requests) means you’re exceeding their rate limit and need to throttle.
Gradual accumulation: 2-3 dead letters per day for a single endpoint, spread evenly. This is the sneakiest pattern — the endpoint is mostly working but has intermittent failures that exhaust retries over time. The fix is usually increasing max_attempts for that specific endpoint or reducing the timeout.
Running the Worker
The main loop that ties everything together:
python
import signal
import sys
running = True
def shutdown_handler(signum, frame):
global running
running = False
print(f"Received signal {signum}, shutting down gracefully...")
signal.signal(signal.SIGTERM, shutdown_handler)
signal.signal(signal.SIGINT, shutdown_handler)
def main():
print("Webhook delivery worker starting...")
while running:
try:
deliver_webhooks(batch_size=50)
except Exception as e:
print(f"Worker error: {e}")
time.sleep(5) # back off on errors
continue
# Update latency stats every 100 iterations
# (cheap operation, doesn't need to run every loop)
if int(time.time()) % 100 == 0:
conn = get_connection()
cur = conn.cursor(cursor_factory=RealDictCursor)
try:
cur.execute(
"SELECT DISTINCT endpoint_id FROM endpoint_health"
)
for row in cur.fetchall():
update_latency_percentiles(cur, row['endpoint_id'])
conn.commit()
finally:
cur.close()
conn.close()
# Poll interval — 500ms keeps latency low without
# hammering the database
time.sleep(0.5)
print("Worker shut down cleanly.")
if __name__ == '__main__':
main()
The SIGTERM handler is essential for clean shutdowns in containerized environments. When Kubernetes sends SIGTERM, the worker finishes its current batch, commits the transaction, and exits. Without this, you get rows stuck in in_flight status with no worker processing them.
What This System Doesn’t Do (And When You Need More)
This implementation handles up to about 10,000 deliveries per minute on a single PostgreSQL instance with 2-3 worker processes. Beyond that, three changes are needed.
First, replace the PostgreSQL queue with Redis Streams or RabbitMQ. The SELECT FOR UPDATE SKIP LOCKED pattern creates write contention on the queue table at high throughput. A dedicated message broker eliminates this.
Second, add per-endpoint rate limiting. Some receiving endpoints have rate limits (100 requests per minute, 1,000 per hour). Without client-side rate limiting, you’ll blow through their quota and get 429’d. Implement a token bucket per endpoint.
Third, add request signing. HMAC-SHA256 signatures on the payload let the receiving endpoint verify that the webhook came from your system and wasn’t tampered with in transit. This is table stakes for any webhook system that sends financial data.
The system in this article is the foundation. It handles the hard problems — retry logic, circuit breaking, latency measurement, dead letter analysis — that every webhook delivery system needs regardless of scale. The specific components you bolt on (message broker, rate limiter, request signing) depend on your throughput and security requirements.
The part that matters most is the part most teams skip: measuring the delivery system itself. If you can’t answer “what is the P95 delivery latency to Endpoint X over the last 24 hours,” you’re operating blind. Build the instrumentation first. Everything else follows.