# Retry Mechanism

Appmixer 6.4 introduces a redesigned retry mechanism that provides intelligent error handling, flow fairness, and graceful recovery from system downtime. The new system automatically determines which errors should be retried, prevents any single flow from monopolizing retry resources, and safely handles large retry backlogs without crashing the system.

The retry mechanism is fully backward compatible and enabled by default with production-ready settings. Advanced users can customize the behavior through environment variables.

### Error Classification System

The error classification system intelligently determines which errors should be retried based on error codes and types. This prevents wasting resources retrying errors that will never succeed (like 404 Not Found or 401 Unauthorized).

#### RETRY\_ERROR\_CLASSIFICATION\_ENABLED

Enable or disable intelligent error classification. When enabled, only retriable errors are retried. When disabled, all errors are retried (backward compatible behavior).

**Default value:** `true`

**Non-retriable errors** (sent directly to UnprocessedMessages):

* Client errors: 400, 401, 403, 404, 405, 406, 409, 410, 411, 422, 451
* Configuration errors: EACCES, EINVAL, ENOENT
* Validation errors

**Retriable errors** (will be retried):

* Server errors: 500, 502, 503, 504
* Timeout errors: 408, ETIMEDOUT, ESOCKETTIMEDOUT
* Connection errors: ECONNREFUSED, ECONNRESET, ENOTFOUND, EHOSTUNREACH
* Rate limiting: 429
* Network errors: EAI\_AGAIN

#### RETRY\_ERROR\_CLASSIFICATIONS

Custom error code overrides for specific error types. This allows you to customize which errors are retriable for your specific integrations.

**Format:** JSON object `{"errorCode": boolean, ...}` **Default value:** `{}` (uses built-in classification rules)

**Example:**

```yaml
  engine:
    ...
    environment:
      - RETRY_ERROR_CLASSIFICATIONS={"CUSTOM_ERROR": true, "400": false}
    ...
```

#### RETRY\_UNKNOWN\_ERRORS

Default behavior for unknown or unclassified errors.

**Options:**

* `true` - Retry unknown errors (fail-open, safer) - **Default**
* `false` - Don't retry unknown errors (fail-closed, more conservative)

**Example:**

```yaml
  engine:
    ...
    environment:
      - RETRY_UNKNOWN_ERRORS=false
    ...
```

### Retry Backoff Configuration

Controls the time intervals between retry attempts.

#### RETRY\_BACKOFF

Comma-separated list of time intervals between retry attempts. The system will retry using these intervals in sequence.

**Default value:** `"1,5,60,300,720"` (1 minute, 5 minutes, 1 hour, 5 hours, 12 hours)

**Example - faster retries:**

```yaml
  engine:
    ...
    environment:
      - RETRY_BACKOFF=1,2,5,15,60
    ...
```

#### RETRY\_BACKOFF\_UNITS

Time unit for the backoff intervals.

**Options:** `seconds`, `minutes`, `hours`, `days` **Default value:** `minutes`

**Example - backoff in seconds:**

```yaml
  engine:
    ...
    environment:
      - RETRY_BACKOFF=10,30,60,180,600
      - RETRY_BACKOFF_UNITS=seconds
    ...
```

### Retry Quotas

Retry quotas prevent excessive retry attempts from overwhelming the system. When a quota is exceeded, messages are saved to UnprocessedMessages instead of being retried.

#### QUOTA\_CONTEXT\_RETRY

JSON configuration for retry limits at system, user, and flow levels.

**Default quotas:**

* **System level:** 100,000 retries per hour (global limit across all users)
* **User level:** 10,000 retries per hour per user
* **Flow level:** 1,000 retries per hour per flow

**Format:** JSON array of quota rules

```json
[
  {
    "limit": 100000,
    "scope": "system",
    "windowInSeconds": 3600,
    "name": "retry:global:1h"
  },
  {
    "limit": 10000,
    "scope": "user",
    "windowInSeconds": 3600,
    "name": "retry:user:1h"
  },
  {
    "limit": 1000,
    "scope": "flow",
    "windowInSeconds": 3600,
    "name": "retry:flow:1h"
  }
]
```

**Example - custom quotas:**

```yaml
  engine:
    ...
    environment:
      - QUOTA_CONTEXT_RETRY=[{"limit":200000,"scope":"system","windowInSeconds":3600,"name":"retry:global:1h"},{"limit":20000,"scope":"user","windowInSeconds":3600,"name":"retry:user:1h"}]
    ...
```

{% hint style="info" %}
Retry quotas can also be configured dynamically through the Backoffice System Configuration page.
{% endhint %}

### Advanced Tuning (Optional)

These settings control the DelayedMessages job that processes retries. The default values are production-ready for most deployments.

#### DELAYED\_MESSAGE\_CONCURRENCY\_GREEN\_STATE

Maximum retry messages processed per second when the system is healthy (InputQueue in green state).

**Default value:** `20` messages/second

#### DELAYED\_MESSAGE\_CONCURRENCY\_YELLOW\_STATE

Maximum retry messages processed per second when the system is under stress (InputQueue in yellow state).

**Default value:** `5` messages/second

{% hint style="info" %}
When InputQueue reaches the red state, retry processing automatically slows to 1 message/second to prevent system overload.
{% endhint %}

#### DELAYED\_MESSAGES\_BATCH\_SIZE\_PER\_FLOW

The maximum number of retry messages processed per flow per round in the fair scheduling algorithm. This ensures no single flow monopolizes retry processing.

**Default value:** `100` messages per flow per round

#### DELAYED\_MESSAGES\_TIME\_WINDOW\_MS

Time window size for round-robin scheduling between flows.

**Default value:** `10000` milliseconds (10 seconds)

**Example - high-volume environment tuning:**

```yaml
  engine:
    ...
    environment:
      - DELAYED_MESSAGE_CONCURRENCY_GREEN_STATE=50
      - DELAYED_MESSAGE_CONCURRENCY_YELLOW_STATE=10
      - DELAYED_MESSAGES_BATCH_SIZE_PER_FLOW=200
    ...
```

### Key Behavioral Changes

The redesigned retry mechanism introduces several important improvements:

1. **Intelligent Error Handling**: Not all errors are retried. Client errors (4XX) and validation errors go directly to UnprocessedMessages, preventing wasted retry attempts.
2. **Flow Fairness**: Round-robin scheduling ensures no single flow can monopolize retry processing. Each flow processes a maximum of 100 messages per round before moving to the next flow.
3. **Graceful Recovery**: The system safely handles large retry backlogs (tested with 700,000+ accumulated retries) by dynamically adjusting processing speed based on InputQueue health.
4. **Quota Protection**: Retry quotas prevent runaway retry loops and protect system resources.
5. **Circuit Breaker Integration**: Retry processing automatically slows down or stops when the system is under stress, then resumes when capacity is available.

### Backward Compatibility

The retry mechanism redesign is fully backward compatible:

* **No action required**: Default settings work for existing deployments
* **No database changes**: Works with the existing MongoDB schema
* **Graceful migration**: Existing delayed messages are processed with the new system

### Monitoring & Troubleshooting

**Monitor retry health:**

* Watch InputQueue health status (green/yellow/red) in logs
* Track retry backlog size in MongoDB `delayedMessages` collection
* Monitor quota usage in system logs
* Set up alerts for RED/BLACK circuit breaker states

**Common troubleshooting:**

* **High retry backlog**: Increase concurrency settings or investigate the root cause of errors
* **Quota exceeded errors**: Review quota limits or fix underlying integration issues
* **Retries not processing**: Check InputQueue health status and circuit breaker state


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.appmixer.com/appmixer-self-managed/configuration/retry-mechanism.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
