# Monitoring and Observability

### Overview

When deploying Appmixer in a self-managed environment, implementing comprehensive monitoring and observability is crucial for maintaining system health, performance, and reliability. This guide provides recommendations for monitoring both the infrastructure components and the Appmixer application itself.

Appmixer is a Node.js-based application that depends on several infrastructure components:

* MongoDB - Primary data store
* Redis - Caching
* RabbitMQ - Message queue for asynchronous processing
* Elasticsearch - Logs storage
* Logstash - Log processing

This document is organized into two main sections:

* Application Monitoring - Appmixer-specific metrics and health indicators
* Infrastructure Monitoring - Guidelines for monitoring the underlying services

> **Note:** We are actively working on expanding this documentation. Future updates will include additional metrics, detailed dashboards, and downloadable configuration files (Grafana dashboards, Prometheus configs, alerting rules) to help you set up monitoring faster. Check back regularly for updates or contact support if you need assistance with your monitoring setup.

### Application Monitoring

#### Appmixer Application Metrics

This section covers monitoring specific to the Appmixer Node.js application. The following areas should be monitored to ensure optimal application performance:

#### Application Health Endpoints

Appmixer provides two primary health check endpoints for monitoring application availability and system health:

**Root Endpoint: GET /**

The root endpoint provides a basic liveness check and returns basic API information. This endpoint does not require authentication and can be used for:

* Kubernetes/OpenShift liveness probes - To detect if the pod is responsive
* Load balancer health checks - To verify the service is accepting connections
* Basic availability monitoring - To ensure the HTTP server is running

**Response Format:**

```json
{
  "name": "appmixer",
  "version": "6.2.0",
  "url": "http://api.appmixer.com",
  "studioUrl": "https://studio.appmixer.com",
  "integrationsUrl": "https://studio.appmixer.com/integrations"
}
```

**Response Fields:**

* `name` - Application name
* `version` - Appmixer version
* `url` - API endpoint URL
* `studioUrl` - Studio UI URL
* `integrationsUrl` - Integrations marketplace URL

**Monitoring Recommendations:**

* Use this endpoint for basic availability monitoring
* Expected response time: < 100ms
* Expected HTTP status: 200 OK
* Alert if response time > 500ms or status != 200
* Configure as liveness probe in OpenShift/Kubernetes

**System Health Endpoint: GET /system/health**

The system health endpoint provides detailed metrics about the Appmixer application's internal state and performance. This endpoint is designed for deep health monitoring and diagnostics.

**Authentication:**

* Requires API key authentication OR
* JWT authentication with admin scope

**Response Format:**

```json
{
  "inputQueue": {
    "messageCount": 42
  },
  "eventsListeners": {
    "total": 150,
    "byType": {
      "webhook": 80,
      "scheduled": 70
    }
  },
  "events": 1523,
  "listeners": 245,
  "actions": 890,
  "slowQueue": {
    "count": 5,
    "top": [
      {
        "flowId": "flow-123",
        "count": 3
      }
    ]
  },
  "systemWebhooks": {
    "registered": 12,
    "active": 10
  }
}
```

**Response Fields:**

* `inputQueue.messageCount` - Number of messages waiting in the main input queue for processing
  * This is the most critical metric for monitoring system load
  * Normal range: 0-1000 messages
  * Warning threshold: > 2000 messages
  * Critical threshold: > 10000 messages
  * High queue length indicates processing bottleneck or insufficient resources
* `eventsListeners` - Statistics about registered event listeners (webhooks, triggers)
  * `total` - Total number of active event listeners
  * `byType` - Breakdown by listener type
  * Helps monitor integration activity
* `events` - Total count of events in the system
  * Represents pending events waiting for listeners to process
  * High count may indicate listener processing issues
* `listeners` - Count of flow listeners
  * Number of components waiting for incoming data
* `slowQueue.count` - Number of flows currently in the slow queue
  * Flows experiencing repeated failures or slow performance
  * Should be monitored for troubleshooting
* `slowQueue.top` - Top 100 flows by slow queue occurrence
  * Identifies problematic flows requiring attention
  * Each entry includes `flowId` and `count`
* `systemWebhooks` - System webhook statistics (only available on worker nodes)
  * `registered` - Number of system webhooks configured
  * `active` - Number of currently active webhooks

**Monitoring Recommendations:**

Critical Metrics to Monitor:

* `inputQueue.messageCount` - Alert if > 2000 (warning) or > 10000 (critical)
* HTTP status - Alert if not 200 OK

Warning Indicators:

* `slowQueue.count` increasing over time
* `events` count growing continuously
* Response time degradation

Alert Examples:

* Critical: Input queue > 1000 messages for > 5 minutes
* Warning: Input queue > 500 messages for > 10 minutes
* Warning: Slow queue count > 10 flows
* Info: System webhooks registered != active

Dashboard Visualization:

* Line chart: `inputQueue.messageCount` over time
* Gauge: Current input queue size vs thresholds
* Table: Top slow queue flows
* Counter: Total events, listeners, actions

**Example cURL Request:**

```bash
# Using API key authentication
curl -H "X-API-Key: your-api-key" https://api.appmixer.com/system/health

# Using JWT token with admin scope
curl -H "Authorization: Bearer your-jwt-token" https://api.appmixer.com/system/health
```

**Integration with Monitoring Tools:**

Prometheus scraper configuration example:

```yaml
scrape_configs:
  - job_name: 'appmixer-health'
    metrics_path: '/system/health'
    scheme: https
    static_configs:
      - targets: ['api.appmixer.com']
    bearer_token: 'your-api-key'
    scrape_interval: 60s
```

#### Flow Execution Metrics

*This section will be expanded in a future update to include metrics for workflow execution times, success/failure rates, and active flow monitoring. For now, we recommend monitoring the `inputQueue.messageCount` and `slowQueue` metrics from the `/system/health` endpoint as primary indicators of flow execution health.*

#### Component Performance

*Detailed component-level performance monitoring guidance is planned for a future documentation update. In the meantime, standard Node.js application monitoring practices apply, and errors from individual connectors will appear in your Elasticsearch logs.*

### Infrastructure Monitoring

#### General Principles

For each infrastructure component, we recommend monitoring:

* **Availability** - Is the service up and responding?
* **Performance** - Response times, throughput, and latency
* **Resource Utilization** - CPU, memory, disk, and network usage
* **Error Rates** - Connection errors, timeouts, and failures
* **Capacity Planning** - Trends for storage, connections, and load

#### MongoDB Monitoring

MongoDB is the primary data store for Appmixer. Monitor the following metrics:

**Key Metrics**

**Database Performance**

* Query execution time (slow queries)
* Operations per second (reads/writes)
* Document scan rates
* Index usage and efficiency

**Replication (if using replica sets)**

* Replication lag
* Oplog window
* Member health status
* Election events

**Resource Usage**

* Memory utilization (resident and virtual)
* Disk I/O (read/write operations)
* Disk space usage and growth rate
* Network throughput

**Recommended Tools**

* MongoDB Atlas (for managed MongoDB)
* MongoDB Cloud Manager / Ops Manager
* Prometheus with MongoDB exporter
* Datadog, New Relic, or similar APM tools

**Alert Thresholds (Examples)**

* Replication lag > 10 seconds
* Disk usage > 80%
* Connection pool exhaustion > 90%
* Slow queries > 1000ms

#### Redis Monitoring

Redis is used for caching. Monitor the following:

**Key Metrics**

**Availability**

* Uptime
* Master-slave sync status
* Connection success rate

**Recommended Tools**

* Redis INFO command
* Prometheus with Redis exporter
* RedisInsight
* Cloud-native monitoring (if using managed Redis)

#### RabbitMQ Monitoring

RabbitMQ handles asynchronous message processing for Appmixer workflows and tasks.

**Key Metrics**

**Queue Health**

* Queue length (messages ready)
* Messages unacknowledged
* Message rate (publish/deliver/ack)
* Queue growth rate

**Connection and Channels**

* Failed connection attempts

**Node Health**

* Memory usage (high/low watermarks)
* Disk space (free/used)

**Cluster Health (if clustered)**

* Node availability
* Network partition events
* Mirror queue synchronization

**Recommended Tools**

* RabbitMQ Management Plugin
* Prometheus with RabbitMQ exporter
* Datadog, New Relic, or similar APM tools

**Alert Thresholds (Examples)**

* Queue length growing beyond normal capacity
* Memory usage > 80% of high watermark
* No consumers on critical queues
* Disk space < 20% free

#### Elasticsearch Monitoring

Elasticsearch provides search capabilities and stores operational data.

**Key Metrics**

**Cluster Health**

* Cluster status (green/yellow/red)
* Number of nodes

**Resource Usage**

* JVM heap usage
* JVM garbage collection time
* CPU usage per node
* Disk I/O per node

**Storage**

* Total disk space used
* Index size growth rate

**Recommended Tools**

* Kibana Monitoring
* Elasticsearch Monitoring API
* Prometheus with Elasticsearch exporter
* Cloud-native monitoring (if using managed Elasticsearch)

**Alert Thresholds (Examples)**

* Cluster status = red or yellow for > 5 minutes
* JVM heap usage > 85%
* Disk usage > 85%

#### Logstash Monitoring

Logstash processes and transforms log data before sending it to Elasticsearch.

**Key Metrics**

**Pipeline Performance**

* Events received/filtered/sent

**Recommended Tools**

* Logstash Monitoring API
* Kibana Monitoring
* Prometheus with Logstash exporter

**Alert Thresholds (Examples)**

* JVM heap usage > 85%
* Pipeline events duration increasing
* Dead letter queue growing
* Plugin errors increasing

#### OpenShift Platform Monitoring

Since Appmixer runs on OpenShift, leverage OpenShift's built-in monitoring capabilities:

**Key Aspects**

**Pod Health**

* Pod status (Running/Failed/Pending)
* Restart counts
* Container resource usage vs limits

**Resource Quotas**

* Namespace CPU/memory usage
* Storage usage
* Pod count vs limits

**Network**

* Service availability
* Ingress/route response times
* Network policy effectiveness

**Persistent Volumes**

* Volume usage
* Volume performance metrics
* Volume mount issues

**Tools**

* OpenShift Monitoring (Prometheus-based)
* OpenShift Web Console
* oc CLI monitoring commands

### Recommended Monitoring Stack

#### Option 1: Prometheus + Grafana (Open Source)

* Prometheus for metrics collection
* Grafana for visualization and dashboards
* Alertmanager for alert routing
* Exporters for each infrastructure component
* Custom exporters for Appmixer application metrics

#### Option 2: Commercial APM Solutions

* New Relic
* Datadog
* Dynatrace
* AppDynamics

#### Option 3: Hybrid Approach

* Use OpenShift's built-in Prometheus for infrastructure
* Add custom Grafana dashboards
* Integrate with existing enterprise monitoring tools

### Alerting Strategy

#### Alert Severity Levels

**Critical** - Immediate action required, service degradation or outage

* Production outage
* Data loss risk
* Security breach

**Warning** - Attention needed, potential issues developing

* Resource usage approaching limits
* Performance degradation
* Increased error rates

**Info** - Informational, no immediate action needed

* Deployment notifications
* Configuration changes
* Capacity planning indicators

#### Alert Best Practices

* Define clear runbooks for each alert
* Avoid alert fatigue by tuning thresholds
* Use alert aggregation to reduce noise
* Implement escalation policies
* Test alerting channels regularly

### Logging Strategy

#### Log Aggregation

* Set appropriate log retention policies based on compliance requirements and storage capacity

#### Log Levels

Use appropriate log levels:

* **ERROR** - Application errors requiring attention
* **WARN** - Warning conditions
* **INFO** - Informational messages
* **DEBUG** - Detailed diagnostic information (non-production)

#### Key Logs to Monitor

* Application startup/shutdown events
* Authentication and authorization failures
* API request/response logs (with sampling)
* Integration connector errors
* Database connection issues
* Message queue processing errors

### Performance Tuning

Based on monitoring data, consider these tuning areas:

#### Node.js Application

* Adjust worker thread pool size
* Optimize memory limits and heap size
* Enable clustering for horizontal scaling
* Review and optimize slow database queries

#### Kubernetes Resources

* Right-size CPU and memory requests/limits
* Configure horizontal pod autoscaling (HPA)
* Implement pod disruption budgets
* Optimize persistent volume performance class

### Capacity Planning

Regular capacity planning should review:

#### Growth Trends

* User/tenant growth rate
* Flow execution volume trends
* Data storage growth

#### Resource Utilization

* Average and peak CPU/memory usage
* Database storage growth
* Network bandwidth utilization

#### Performance Baselines

* Establish performance baselines
* Track deviation from baselines
* Plan scaling activities before limits are reached

### Compliance and Security Monitoring

* Monitor access logs for suspicious activity
* Track failed authentication attempts
* Review audit logs for compliance requirements
* Track SSL certificate expiration dates

### Additional Resources

* [OpenShift Monitoring documentation](https://docs.openshift.com/container-platform/latest/monitoring/monitoring-overview.html)
* [MongoDB Operations Best Practices](https://www.mongodb.com/docs/manual/administration/production-notes/)
* [RabbitMQ Monitoring Guide](https://www.rabbitmq.com/monitoring.html)
* [Elasticsearch Monitoring and Observability](https://www.elastic.co/guide/en/elasticsearch/reference/current/monitor-elasticsearch-cluster.html)
