Monitoring and Observability

Overview

When deploying Appmixer in a self-managed environment, implementing comprehensive monitoring and observability is crucial for maintaining system health, performance, and reliability. This guide provides recommendations for monitoring both the infrastructure components and the Appmixer application itself.

Appmixer is a Node.js-based application that depends on several infrastructure components:

  • MongoDB - Primary data store

  • Redis - Caching

  • RabbitMQ - Message queue for asynchronous processing

  • Elasticsearch - Logs storage

  • Logstash - Log processing

This document is organized into two main sections:

  • Application Monitoring - Appmixer-specific metrics and health indicators

  • Infrastructure Monitoring - Guidelines for monitoring the underlying services

Note: We are actively working on expanding this documentation. Future updates will include additional metrics, detailed dashboards, and downloadable configuration files (Grafana dashboards, Prometheus configs, alerting rules) to help you set up monitoring faster. Check back regularly for updates or contact support if you need assistance with your monitoring setup.

Application Monitoring

Appmixer Application Metrics

This section covers monitoring specific to the Appmixer Node.js application. The following areas should be monitored to ensure optimal application performance:

Application Health Endpoints

Appmixer provides two primary health check endpoints for monitoring application availability and system health:

Root Endpoint: GET /

The root endpoint provides a basic liveness check and returns basic API information. This endpoint does not require authentication and can be used for:

  • Kubernetes/OpenShift liveness probes - To detect if the pod is responsive

  • Load balancer health checks - To verify the service is accepting connections

  • Basic availability monitoring - To ensure the HTTP server is running

Response Format:

Response Fields:

  • name - Application name

  • version - Appmixer version

  • url - API endpoint URL

  • studioUrl - Studio UI URL

  • integrationsUrl - Integrations marketplace URL

Monitoring Recommendations:

  • Use this endpoint for basic availability monitoring

  • Expected response time: < 100ms

  • Expected HTTP status: 200 OK

  • Alert if response time > 500ms or status != 200

  • Configure as liveness probe in OpenShift/Kubernetes

System Health Endpoint: GET /system/health

The system health endpoint provides detailed metrics about the Appmixer application's internal state and performance. This endpoint is designed for deep health monitoring and diagnostics.

Authentication:

  • Requires API key authentication OR

  • JWT authentication with admin scope

Response Format:

Response Fields:

  • inputQueue.messageCount - Number of messages waiting in the main input queue for processing

    • This is the most critical metric for monitoring system load

    • Normal range: 0-1000 messages

    • Warning threshold: > 2000 messages

    • Critical threshold: > 10000 messages

    • High queue length indicates processing bottleneck or insufficient resources

  • eventsListeners - Statistics about registered event listeners (webhooks, triggers)

    • total - Total number of active event listeners

    • byType - Breakdown by listener type

    • Helps monitor integration activity

  • events - Total count of events in the system

    • Represents pending events waiting for listeners to process

    • High count may indicate listener processing issues

  • listeners - Count of flow listeners

    • Number of components waiting for incoming data

  • slowQueue.count - Number of flows currently in the slow queue

    • Flows experiencing repeated failures or slow performance

    • Should be monitored for troubleshooting

  • slowQueue.top - Top 100 flows by slow queue occurrence

    • Identifies problematic flows requiring attention

    • Each entry includes flowId and count

  • systemWebhooks - System webhook statistics (only available on worker nodes)

    • registered - Number of system webhooks configured

    • active - Number of currently active webhooks

Monitoring Recommendations:

Critical Metrics to Monitor:

  • inputQueue.messageCount - Alert if > 2000 (warning) or > 10000 (critical)

  • HTTP status - Alert if not 200 OK

Warning Indicators:

  • slowQueue.count increasing over time

  • events count growing continuously

  • Response time degradation

Alert Examples:

  • Critical: Input queue > 1000 messages for > 5 minutes

  • Warning: Input queue > 500 messages for > 10 minutes

  • Warning: Slow queue count > 10 flows

  • Info: System webhooks registered != active

Dashboard Visualization:

  • Line chart: inputQueue.messageCount over time

  • Gauge: Current input queue size vs thresholds

  • Table: Top slow queue flows

  • Counter: Total events, listeners, actions

Example cURL Request:

Integration with Monitoring Tools:

Prometheus scraper configuration example:

Flow Execution Metrics

This section will be expanded in a future update to include metrics for workflow execution times, success/failure rates, and active flow monitoring. For now, we recommend monitoring the inputQueue.messageCount and slowQueue metrics from the /system/health endpoint as primary indicators of flow execution health.

Component Performance

Detailed component-level performance monitoring guidance is planned for a future documentation update. In the meantime, standard Node.js application monitoring practices apply, and errors from individual connectors will appear in your Elasticsearch logs.

Infrastructure Monitoring

General Principles

For each infrastructure component, we recommend monitoring:

  • Availability - Is the service up and responding?

  • Performance - Response times, throughput, and latency

  • Resource Utilization - CPU, memory, disk, and network usage

  • Error Rates - Connection errors, timeouts, and failures

  • Capacity Planning - Trends for storage, connections, and load

MongoDB Monitoring

MongoDB is the primary data store for Appmixer. Monitor the following metrics:

Key Metrics

Database Performance

  • Query execution time (slow queries)

  • Operations per second (reads/writes)

  • Document scan rates

  • Index usage and efficiency

Replication (if using replica sets)

  • Replication lag

  • Oplog window

  • Member health status

  • Election events

Resource Usage

  • Memory utilization (resident and virtual)

  • Disk I/O (read/write operations)

  • Disk space usage and growth rate

  • Network throughput

Recommended Tools

  • MongoDB Atlas (for managed MongoDB)

  • MongoDB Cloud Manager / Ops Manager

  • Prometheus with MongoDB exporter

  • Datadog, New Relic, or similar APM tools

Alert Thresholds (Examples)

  • Replication lag > 10 seconds

  • Disk usage > 80%

  • Connection pool exhaustion > 90%

  • Slow queries > 1000ms

Redis Monitoring

Redis is used for caching. Monitor the following:

Key Metrics

Availability

  • Uptime

  • Master-slave sync status

  • Connection success rate

Recommended Tools

  • Redis INFO command

  • Prometheus with Redis exporter

  • RedisInsight

  • Cloud-native monitoring (if using managed Redis)

RabbitMQ Monitoring

RabbitMQ handles asynchronous message processing for Appmixer workflows and tasks.

Key Metrics

Queue Health

  • Queue length (messages ready)

  • Messages unacknowledged

  • Message rate (publish/deliver/ack)

  • Queue growth rate

Connection and Channels

  • Failed connection attempts

Node Health

  • Memory usage (high/low watermarks)

  • Disk space (free/used)

Cluster Health (if clustered)

  • Node availability

  • Network partition events

  • Mirror queue synchronization

Recommended Tools

  • RabbitMQ Management Plugin

  • Prometheus with RabbitMQ exporter

  • Datadog, New Relic, or similar APM tools

Alert Thresholds (Examples)

  • Queue length growing beyond normal capacity

  • Memory usage > 80% of high watermark

  • No consumers on critical queues

  • Disk space < 20% free

Elasticsearch Monitoring

Elasticsearch provides search capabilities and stores operational data.

Key Metrics

Cluster Health

  • Cluster status (green/yellow/red)

  • Number of nodes

Resource Usage

  • JVM heap usage

  • JVM garbage collection time

  • CPU usage per node

  • Disk I/O per node

Storage

  • Total disk space used

  • Index size growth rate

Recommended Tools

  • Kibana Monitoring

  • Elasticsearch Monitoring API

  • Prometheus with Elasticsearch exporter

  • Cloud-native monitoring (if using managed Elasticsearch)

Alert Thresholds (Examples)

  • Cluster status = red or yellow for > 5 minutes

  • JVM heap usage > 85%

  • Disk usage > 85%

Logstash Monitoring

Logstash processes and transforms log data before sending it to Elasticsearch.

Key Metrics

Pipeline Performance

  • Events received/filtered/sent

Recommended Tools

  • Logstash Monitoring API

  • Kibana Monitoring

  • Prometheus with Logstash exporter

Alert Thresholds (Examples)

  • JVM heap usage > 85%

  • Pipeline events duration increasing

  • Dead letter queue growing

  • Plugin errors increasing

OpenShift Platform Monitoring

Since Appmixer runs on OpenShift, leverage OpenShift's built-in monitoring capabilities:

Key Aspects

Pod Health

  • Pod status (Running/Failed/Pending)

  • Restart counts

  • Container resource usage vs limits

Resource Quotas

  • Namespace CPU/memory usage

  • Storage usage

  • Pod count vs limits

Network

  • Service availability

  • Ingress/route response times

  • Network policy effectiveness

Persistent Volumes

  • Volume usage

  • Volume performance metrics

  • Volume mount issues

Tools

  • OpenShift Monitoring (Prometheus-based)

  • OpenShift Web Console

  • oc CLI monitoring commands

Option 1: Prometheus + Grafana (Open Source)

  • Prometheus for metrics collection

  • Grafana for visualization and dashboards

  • Alertmanager for alert routing

  • Exporters for each infrastructure component

  • Custom exporters for Appmixer application metrics

Option 2: Commercial APM Solutions

  • New Relic

  • Datadog

  • Dynatrace

  • AppDynamics

Option 3: Hybrid Approach

  • Use OpenShift's built-in Prometheus for infrastructure

  • Add custom Grafana dashboards

  • Integrate with existing enterprise monitoring tools

Alerting Strategy

Alert Severity Levels

Critical - Immediate action required, service degradation or outage

  • Production outage

  • Data loss risk

  • Security breach

Warning - Attention needed, potential issues developing

  • Resource usage approaching limits

  • Performance degradation

  • Increased error rates

Info - Informational, no immediate action needed

  • Deployment notifications

  • Configuration changes

  • Capacity planning indicators

Alert Best Practices

  • Define clear runbooks for each alert

  • Avoid alert fatigue by tuning thresholds

  • Use alert aggregation to reduce noise

  • Implement escalation policies

  • Test alerting channels regularly

Logging Strategy

Log Aggregation

  • Set appropriate log retention policies based on compliance requirements and storage capacity

Log Levels

Use appropriate log levels:

  • ERROR - Application errors requiring attention

  • WARN - Warning conditions

  • INFO - Informational messages

  • DEBUG - Detailed diagnostic information (non-production)

Key Logs to Monitor

  • Application startup/shutdown events

  • Authentication and authorization failures

  • API request/response logs (with sampling)

  • Integration connector errors

  • Database connection issues

  • Message queue processing errors

Performance Tuning

Based on monitoring data, consider these tuning areas:

Node.js Application

  • Adjust worker thread pool size

  • Optimize memory limits and heap size

  • Enable clustering for horizontal scaling

  • Review and optimize slow database queries

Kubernetes Resources

  • Right-size CPU and memory requests/limits

  • Configure horizontal pod autoscaling (HPA)

  • Implement pod disruption budgets

  • Optimize persistent volume performance class

Capacity Planning

Regular capacity planning should review:

  • User/tenant growth rate

  • Flow execution volume trends

  • Data storage growth

Resource Utilization

  • Average and peak CPU/memory usage

  • Database storage growth

  • Network bandwidth utilization

Performance Baselines

  • Establish performance baselines

  • Track deviation from baselines

  • Plan scaling activities before limits are reached

Compliance and Security Monitoring

  • Monitor access logs for suspicious activity

  • Track failed authentication attempts

  • Review audit logs for compliance requirements

  • Track SSL certificate expiration dates

Additional Resources

Last updated

Was this helpful?