Monitoring and Observability
Monitor EmailEngine health, performance, and activity with built-in health check endpoints, Prometheus metrics, and integrations with popular observability platforms.
Overview
EmailEngine provides comprehensive monitoring capabilities:
- Health Check Endpoints - Simple HTTP endpoints for uptime monitoring
- Prometheus Metrics - Detailed metrics for Prometheus/Grafana stack
- Performance Indicators - Track message processing, connections, and queue health
- Custom Alerting - Set up alerts based on key metrics
- Bull Board Dashboard - Visual queue monitoring (see Webhooks Guide)
Health Check Endpoints
Basic Health Check
Use the /health endpoint to verify EmailEngine is running:
curl http://localhost:3000/health
Response when healthy:
{
"success": true
}
The health check verifies:
- All IMAP workers are available
- Redis database is accessible and responding
Detailed Status Check
Get more detailed status information:
curl http://localhost:3000/v1/stats \
-H "Authorization: Bearer YOUR_TOKEN"
Response includes:
{
"version": "2.58.0",
"license": "MIT",
"accounts": 15,
"node": "24.0.0",
"redis": "7.2.4",
"counters": {
"events:messageNew": 1523,
"webhooks:messageNew": 1450,
"apiReq:GET /v1/stats": 234
},
"queues": {
"notify": {
"active": 2,
"delayed": 0,
"waiting": 0,
"paused": 0,
"isPaused": false,
"total": 2
},
"submit": {
"active": 1,
"delayed": 0,
"waiting": 5,
"paused": 0,
"isPaused": false,
"total": 6
},
"documents": {
"active": 0,
"delayed": 0,
"waiting": 0,
"paused": 0,
"isPaused": false,
"total": 0
}
},
"connections": {
"connected": 14,
"connecting": 1
}
}
Prometheus Metrics
Setting Up Prometheus
EmailEngine exposes Prometheus metrics at /metrics endpoint.
Step 1: Create Metrics Token
- Navigate to Settings → Access Tokens in EmailEngine UI
- Click Create new
- Uncheck All scopes
- Check only Metrics scope
- Create token and save it
Step 2: Configure Prometheus
Add EmailEngine as a scraping target in prometheus.yml:
scrape_configs:
- job_name: 'emailengine'
scrape_interval: 10s
metrics_path: '/metrics'
scheme: 'http'
authorization:
type: Bearer
credentials: 795f623527c16d617b106... # Your metrics token
static_configs:
- targets: ['127.0.0.1:3000']
For multiple EmailEngine instances:
scrape_configs:
- job_name: 'emailengine'
scrape_interval: 10s
metrics_path: '/metrics'
scheme: 'https'
authorization:
type: Bearer
credentials: YOUR_METRICS_TOKEN
static_configs:
- targets:
- 'ee-prod-01.example.com:3000'
- 'ee-prod-02.example.com:3000'
- 'ee-prod-03.example.com:3000'
labels:
environment: 'production'
Step 3: Restart Prometheus
# SystemD
sudo systemctl restart prometheus
# Docker
docker restart prometheus
Step 4: Verify
Check Prometheus targets page:
http://localhost:9090/targets
EmailEngine should appear with status UP.
Available Metrics
EmailEngine exposes these Prometheus metrics:
Connection Metrics
# IMAP connections by status
imap_connections{status="connected"}
imap_connections{status="connecting"}
imap_connections{status="authenticationError"}
imap_connections{status="connectError"}
imap_connections{status="syncing"}
imap_connections{status="disconnected"}
# IMAP responses
imap_responses{response="OK"}
imap_responses{response="OK",code="CAPABILITY"}
# IMAP traffic
imap_bytes_sent
imap_bytes_received
Webhook and Event Metrics
# Webhooks sent by event and status
webhooks{event="messageNew",status="success"}
webhooks{event="messageUpdated",status="success"}
webhooks{event="messageDeleted",status="success"}
# Events fired
events{event="messageNew"}
events{event="messageUpdated"}
# Webhook request duration (buckets in milliseconds)
webhook_req_bucket{le="100"}
webhook_req_bucket{le="1000"}
webhook_req_bucket{le="10000"}
webhook_req_sum
webhook_req_count
Queue Metrics
# Queue sizes by state
queue_size{queue="notify",state="waiting"}
queue_size{queue="submit",state="active"}
queue_size{queue="documents",state="delayed"}
# Processed job counts
queues_processed{queue="notify"}
queues_processed{queue="submit"}
API Metrics
# API calls by method and status
api_call{method="post",route="/v1/account/:account/submit",statusCode="200"}
api_call{method="get",route="/v1/account/:account",statusCode="200"}
System Metrics
# Worker threads
threads{type="imap"}
threads{type="webhooks"}
# Configuration
emailengine_config{version="v2.58.0"}
emailengine_config{config="workersImap"}
Note: Memory usage, CPU usage, and uptime metrics are available through standard Node.js metrics exporters if needed.
Grafana Dashboard
Setting Up Grafana
Create a comprehensive EmailEngine dashboard in Grafana.
Add Prometheus Data Source
- Go to Configuration → Data Sources
- Add Prometheus
- URL:
http://localhost:9090 - Save & Test
Create Dashboard
Import this dashboard JSON or create panels manually:
Panel 1: IMAP Connection Status
# Query
sum by (status) (imap_connections)
# Visualization: Pie Chart
# Legend: {{status}}
Panel 2: Webhook Events Rate
# Query
rate(webhooks[5m]) * 60
# Visualization: Graph
# Label: Webhooks per minute
Panel 3: Webhook Success vs Failure
# Queries (use multiple series)
sum(rate(webhooks{status="success"}[5m])) * 60
sum(rate(webhooks{status="failure"}[5m])) * 60
# Visualization: Graph
# Labels: Success, Failure
Panel 4: Queue Health
# Queries (multiple series)
queue_size{queue="notify",state="waiting"}
queue_size{queue="submit",state="waiting"}
# Visualization: Stat or Graph
# Alert if > 100
Panel 5: IMAP Connections by Status
# Query
imap_connections{status="connected"}
imap_connections{status="authenticationError"}
imap_connections{status="connectError"}
# Visualization: Graph (stacked)
# Legend: {{status}}
Panel 6: Webhook Response Time
# Query (99th percentile, result in milliseconds)
histogram_quantile(0.99,
rate(webhook_req_bucket[5m])
)
# Visualization: Graph
# Label: 99th percentile webhook duration (ms)
Dashboard Variables
Add variables for filtering:
$environment = label_values(imap_connections, environment)
$instance = label_values(imap_connections, instance)
Key Metrics to Monitor
Critical Metrics
Monitor these metrics closely in production:
1. Account Connection Health
# Connected accounts
imap_connections{status="connected"}
# Disconnected or errored accounts
imap_connections{status="authenticationError"}
imap_connections{status="connectError"}
imap_connections{status="disconnected"}
# Alert if too many disconnected
2. Webhook Queue Size
# Alert if queue is backing up
queue_size{queue="notify",state="waiting"} > 100
3. Webhook Failure Rate
# Alert if failure rate > 5%
(sum(rate(webhooks{status="failure"}[5m])) /
sum(rate(webhooks[5m]))) * 100 > 5
4. Webhook Processing Time
# Alert if webhooks are processing slowly (99th percentile > 5 seconds)
# Note: webhook_req buckets are in milliseconds
histogram_quantile(0.99,
rate(webhook_req_bucket[5m])
) > 5000
5. Queue Processing Rate
# Monitor queue processing rate
rate(queues_processed{queue="notify"}[5m])
rate(queues_processed{queue="submit"}[5m])
Performance Indicators
Track these for performance optimization:
# Webhook events per minute
rate(webhooks[5m]) * 60
# Webhook processing time (median, in milliseconds)
histogram_quantile(0.5,
rate(webhook_req_bucket[5m])
)
# Queue throughput
rate(queues_processed[5m])
# Active queue jobs
queue_size{state="active"}
# API call rate by endpoint
rate(api_call[5m])
Alerting Setup
Prometheus Alertmanager
Configure alerts in prometheus_rules.yml:
groups:
- name: emailengine
interval: 30s
rules:
# IMAP connection errors
- alert: EmailEngineConnectionErrors
expr: |
imap_connections{status=~"authenticationError|connectError"} > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Multiple IMAP connection errors"
description: "{{ $value }} accounts with {{ $labels.status }}"
# Webhook queue backing up
- alert: EmailEngineWebhookQueueHigh
expr: queue_size{queue="notify",state="waiting"} > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Webhook queue is backing up"
description: "{{ $value }} webhooks waiting in queue"
# High webhook failure rate
- alert: EmailEngineWebhookFailureRate
expr: |
(sum(rate(webhooks{status="failure"}[5m])) /
sum(rate(webhooks[5m]))) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High webhook failure rate"
description: "{{ $value | humanizePercentage }} webhooks failing"
# EmailEngine down
- alert: EmailEngineDown
expr: up{job="emailengine"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "EmailEngine is down"
description: "EmailEngine on {{ $labels.instance }} is down"
# Slow webhook processing (buckets are in milliseconds)
- alert: EmailEngineSlowWebhooks
expr: |
histogram_quantile(0.99, rate(webhook_req_bucket[5m])) > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Webhooks processing slowly"
description: "99th percentile webhook duration: {{ $value }}ms"
# Queue not processing
- alert: EmailEngineQueueStalled
expr: |
rate(queues_processed[5m]) == 0 and queue_size{state="waiting"} > 0
for: 10m
labels:
severity: critical
annotations:
summary: "Queue processing stalled"
description: "Queue {{ $labels.queue }} has jobs but no processing"
Alertmanager Configuration
Configure notification channels in alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'email-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
- name: 'email-notifications'
email_configs:
- to: 'ops@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alerts'
auth_password: 'secret'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK'
channel: '#emailengine-alerts'
title: 'EmailEngine Alert'
Integration with Observability Platforms
Datadog
Monitor EmailEngine with Datadog APM:
// Pseudo code - implement in your preferred language
// Initialize Datadog tracer
DATADOG_INIT({
service: 'emailengine',
env: ENV['NODE_ENV'],
version: APP_VERSION
})
// Initialize StatsD client
statsd = STATSD_CLIENT({
host: 'datadog-agent',
port: 8125,
prefix: 'emailengine.'
})
// Track custom events
statsd.INCREMENT('accounts.connected')
statsd.GAUGE('queue.size', queue_size)
statsd.TIMING('webhook.duration', duration)
New Relic
Monitor with New Relic APM:
# Install agent
npm install newrelic
# Configure newrelic.js
# Start with agent
node -r newrelic server.js
Elastic APM
Monitor with Elasticsearch APM by initializing the APM agent with your language's elastic-apm library, providing service name, server URL, and environment configuration.
Bull Board Dashboard
EmailEngine uses Bull queues. Monitor them visually with Bull Board.
Bull Board is always enabled and available at:
http://localhost:3000/admin/bull-board
You can also access it from the dashboard sidebar under Tools → Bull Board.
See detailed queue monitoring in Webhooks Guide - Debugging Section.
Log-Based Monitoring
Structured Logging
EmailEngine logs in JSON format (Pino). Parse logs for monitoring:
# Count errors per hour
cat emailengine.log | \
grep '"level":50' | \
jq -r '.time' | \
cut -c1-13 | \
uniq -c
# Track webhook failures
cat emailengine.log | \
grep 'webhook.*failed' | \
jq -r '{time: .time, account: .account, error: .err.message}'
ELK Stack Integration
Ship logs to Elasticsearch:
Filebeat configuration:
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/emailengine/*.log
json.keys_under_root: true
json.add_error_key: true
fields:
service: emailengine
environment: production
output.elasticsearch:
hosts: ["localhost:9200"]
index: "emailengine-%{+yyyy.MM.dd}"
Kibana Dashboard Queries:
# Error rate over time
level:50 AND service:emailengine
# Webhook failures
msg:"webhook failed" AND service:emailengine
# Account connection issues
msg:"connection error" AND service:emailengine
Grafana Loki
Ship logs to Loki with Promtail:
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: emailengine
static_configs:
- targets:
- localhost
labels:
job: emailengine
__path__: /var/log/emailengine/*.log
pipeline_stages:
- json:
expressions:
level: level
message: msg
account: account
- labels:
level:
account:
Monitoring Best Practices
1. Set Appropriate Thresholds
Don't alert on noise:
# Bad - too sensitive
queue_size{state="waiting"} > 0
# Good - meaningful threshold
queue_size{state="waiting"} > 100 for 5m
2. Monitor Trends, Not Just Absolutes
# Track rate of change for webhooks
rate(webhooks{status="failure"}[30m])
3. Create Composite Alerts
# Alert only if multiple conditions met
(queue_size{queue="notify",state="waiting"} > 100) AND
(sum(rate(webhooks{status="failure"}[5m])) > 0.1)
4. Use Alert Grouping
Group related alerts to avoid alarm fatigue:
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
5. Document Runbooks
Include runbook links in alert annotations:
annotations:
summary: "Webhook queue backing up"
description: "{{ $value }} webhooks waiting"
runbook: "https://wiki.example.com/emailengine/webhook-queue-backup"
Health Check Scripts
Simple Uptime Check
#!/bin/bash
# check-emailengine-health.sh
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/health)
if [ "$RESPONSE" != "200" ]; then
echo "EmailEngine health check failed: HTTP $RESPONSE"
exit 1
fi
echo "EmailEngine is healthy"
exit 0
Comprehensive Check
#!/bin/bash
# comprehensive-health-check.sh
TOKEN="$1"
HOST="${2:-localhost:3000}"
# Check basic health endpoint (no auth required)
HEALTH=$(curl -s http://$HOST/health)
SUCCESS=$(echo $HEALTH | jq -r '.success')
if [ "$SUCCESS" != "true" ]; then
echo "CRITICAL: EmailEngine health check failed"
exit 2
fi
# Get detailed stats (requires token)
STATS=$(curl -s -H "Authorization: Bearer $TOKEN" \
"http://$HOST/v1/stats")
# Check connected accounts
CONNECTED=$(echo $STATS | jq -r '.connections.connected // 0')
TOTAL=$(echo $STATS | jq -r '.accounts')
if [ "$TOTAL" -gt 0 ]; then
PERCENT=$(echo "scale=2; $CONNECTED * 100 / $TOTAL" | bc)
if (( $(echo "$PERCENT < 95" | bc -l) )); then
echo "WARNING: Only $PERCENT% accounts connected ($CONNECTED/$TOTAL)"
exit 1
fi
fi
echo "OK: EmailEngine healthy - $CONNECTED/$TOTAL accounts connected"
exit 0
Nagios/Icinga Plugin
#!/bin/bash
# check_emailengine
TOKEN="$1"
HOST="${2:-localhost:3000}"
STATS=$(curl -s -H "Authorization: Bearer $TOKEN" \
"http://$HOST/v1/stats")
# Check webhook queue (notify queue handles webhooks)
QUEUE_WAITING=$(echo $STATS | jq -r '.queues.notify.waiting // 0')
QUEUE_TOTAL=$(echo $STATS | jq -r '.queues.notify.total // 0')
if [ "$QUEUE_WAITING" -gt 100 ]; then
echo "CRITICAL: Webhook queue size $QUEUE_WAITING | queue=$QUEUE_WAITING"
exit 2
fi
# Check queue status
QUEUE_PAUSED=$(echo $STATS | jq -r '.queues.notify.isPaused')
if [ "$QUEUE_PAUSED" = "true" ]; then
echo "WARNING: Webhook queue is paused | paused=1"
exit 1
fi
echo "OK: EmailEngine operational | queue_waiting=$QUEUE_WAITING queue_total=$QUEUE_TOTAL"
exit 0