Retry Mechanism

Appmixer 6.4 introduces a redesigned retry mechanism that provides intelligent error handling, flow fairness, and graceful recovery from system downtime. The new system automatically determines which errors should be retried, prevents any single flow from monopolizing retry resources, and safely handles large retry backlogs without crashing the system.

The retry mechanism is fully backward compatible and enabled by default with production-ready settings. Advanced users can customize the behavior through environment variables.

Error Classification System

The error classification system intelligently determines which errors should be retried based on error codes and types. This prevents wasting resources retrying errors that will never succeed (like 404 Not Found or 401 Unauthorized).

RETRY_ERROR_CLASSIFICATION_ENABLED

Enable or disable intelligent error classification. When enabled, only retriable errors are retried. When disabled, all errors are retried (backward compatible behavior).

Default value: true

Non-retriable errors (sent directly to UnprocessedMessages):

  • Client errors: 400, 401, 403, 404, 405, 406, 409, 410, 411, 422, 451

  • Configuration errors: EACCES, EINVAL, ENOENT

  • Validation errors

Retriable errors (will be retried):

  • Server errors: 500, 502, 503, 504

  • Timeout errors: 408, ETIMEDOUT, ESOCKETTIMEDOUT

  • Connection errors: ECONNREFUSED, ECONNRESET, ENOTFOUND, EHOSTUNREACH

  • Rate limiting: 429

  • Network errors: EAI_AGAIN

RETRY_ERROR_CLASSIFICATIONS

Custom error code overrides for specific error types. This allows you to customize which errors are retriable for your specific integrations.

Format: JSON object {"errorCode": boolean, ...} Default value: {} (uses built-in classification rules)

Example:

RETRY_UNKNOWN_ERRORS

Default behavior for unknown or unclassified errors.

Options:

  • true - Retry unknown errors (fail-open, safer) - Default

  • false - Don't retry unknown errors (fail-closed, more conservative)

Example:

Retry Backoff Configuration

Controls the time intervals between retry attempts.

RETRY_BACKOFF

Comma-separated list of time intervals between retry attempts. The system will retry using these intervals in sequence.

Default value: "1,5,60,300,720" (1 minute, 5 minutes, 1 hour, 5 hours, 12 hours)

Example - faster retries:

RETRY_BACKOFF_UNITS

Time unit for the backoff intervals.

Options: seconds, minutes, hours, days Default value: minutes

Example - backoff in seconds:

Retry Quotas

Retry quotas prevent excessive retry attempts from overwhelming the system. When a quota is exceeded, messages are saved to UnprocessedMessages instead of being retried.

QUOTA_CONTEXT_RETRY

JSON configuration for retry limits at system, user, and flow levels.

Default quotas:

  • System level: 100,000 retries per hour (global limit across all users)

  • User level: 10,000 retries per hour per user

  • Flow level: 1,000 retries per hour per flow

Format: JSON array of quota rules

Example - custom quotas:

Retry quotas can also be configured dynamically through the Backoffice System Configuration page.

Advanced Tuning (Optional)

These settings control the DelayedMessages job that processes retries. The default values are production-ready for most deployments.

DELAYED_MESSAGE_CONCURRENCY_GREEN_STATE

Maximum retry messages processed per second when the system is healthy (InputQueue in green state).

Default value: 20 messages/second

DELAYED_MESSAGE_CONCURRENCY_YELLOW_STATE

Maximum retry messages processed per second when the system is under stress (InputQueue in yellow state).

Default value: 5 messages/second

When InputQueue reaches the red state, retry processing automatically slows to 1 message/second to prevent system overload.

DELAYED_MESSAGES_BATCH_SIZE_PER_FLOW

The maximum number of retry messages processed per flow per round in the fair scheduling algorithm. This ensures no single flow monopolizes retry processing.

Default value: 100 messages per flow per round

DELAYED_MESSAGES_TIME_WINDOW_MS

Time window size for round-robin scheduling between flows.

Default value: 10000 milliseconds (10 seconds)

Example - high-volume environment tuning:

Key Behavioral Changes

The redesigned retry mechanism introduces several important improvements:

  1. Intelligent Error Handling: Not all errors are retried. Client errors (4XX) and validation errors go directly to UnprocessedMessages, preventing wasted retry attempts.

  2. Flow Fairness: Round-robin scheduling ensures no single flow can monopolize retry processing. Each flow processes a maximum of 100 messages per round before moving to the next flow.

  3. Graceful Recovery: The system safely handles large retry backlogs (tested with 700,000+ accumulated retries) by dynamically adjusting processing speed based on InputQueue health.

  4. Quota Protection: Retry quotas prevent runaway retry loops and protect system resources.

  5. Circuit Breaker Integration: Retry processing automatically slows down or stops when the system is under stress, then resumes when capacity is available.

Backward Compatibility

The retry mechanism redesign is fully backward compatible:

  • No action required: Default settings work for existing deployments

  • No database changes: Works with the existing MongoDB schema

  • Graceful migration: Existing delayed messages are processed with the new system

Monitoring & Troubleshooting

Monitor retry health:

  • Watch InputQueue health status (green/yellow/red) in logs

  • Track retry backlog size in MongoDB delayedMessages collection

  • Monitor quota usage in system logs

  • Set up alerts for RED/BLACK circuit breaker states

Common troubleshooting:

  • High retry backlog: Increase concurrency settings or investigate the root cause of errors

  • Quota exceeded errors: Review quota limits or fix underlying integration issues

  • Retries not processing: Check InputQueue health status and circuit breaker state

Last updated

Was this helpful?