Health Monitoring

Portoser tracks service health in two layers — per-service health checks declared in registry.yml, and continuous monitoring by the web backend that aggregates them.

Per-service health checks

Every service declares one health check. Four types are supported:

HTTP

health_check:
  type: http
  url: http://localhost:8080/healthz
  expected_status: 200       # default
  timeout: 5                 # seconds
  interval: 10               # seconds between checks

TCP

health_check:
  type: tcp
  host: localhost
  port: 5432
  timeout: 3
  interval: 10

Process

health_check:
  type: process
  pid_file: ~/.portoser/run/worker.pid
  interval: 10

The process is healthy if the PID file exists and the PID is alive.

Exec

health_check:
  type: exec
  command: pg_isready -h localhost
  expected_exit: 0
  timeout: 5
  interval: 30

Runs an arbitrary command on the host. Use sparingly — exec checks are the most expensive.

Aggregated health

The web backend aggregates per-service checks into:

  • Service health — green / yellow / red per service
  • Machine health — derived from the services running on each machine
  • Cluster health — derived from machine health

The aggregation lives at web/backend/routers/health.py. The web UI's Health Dashboard (/health route) is the front door.

Real-time stream

Two WebSocket endpoints stream live signals:

Endpoint What it streams
/api/ws/metrics Per-device CPU, memory, disk samples; subscription-based
/ws Deployment events: deployment_started, deployment_log, deployment_completed, deployment_failed

Subscribe to the metrics stream from your own client by sending {"type": "subscribe", "device_id": "<id>"} after connecting.

CLI

portoser health                    # all services across all hosts
portoser health <service>          # one service
portoser health --watch            # live, refreshes in place
portoser health --json-output      # machine-parsable

What "healthy" actually means

A service is healthy when:

  1. Its health check has passed at least once
  2. The most recent check passed within the configured interval × 2

A service is degraded when checks have started failing but the failure budget hasn't been exhausted. Unhealthy when failures exceed the budget.

The budget is configurable per service via health_check.failure_threshold (default 3).

What health monitoring does NOT do

  • It does not auto-restart services — that's the job of the self-healing loop, which runs at deploy time.
  • It does not page on-call. There's no alerting backend baked in. If you need pages, point a Prometheus or Healthchecks.io scrape at /api/health/dashboard and configure your own alerts.
  • It does not store long-term metrics history. Metrics are kept in memory + Redis cache; for retention beyond a few hours, run a real TSDB (Prometheus, VictoriaMetrics).

Wiring external monitoring

To export to Prometheus, point a scrape at:

GET /api/metrics?format=prometheus

For uptime checks, the simplest setup is an external scraper (UptimeRobot, Healthchecks.io, your own cron) hitting /api/health/dashboard and alerting when overall status drops.