Health Monitoring
Portoser tracks service health in two layers — per-service health checks declared in registry.yml, and continuous monitoring by the web backend that aggregates them.
Per-service health checks
Every service declares one health check. Four types are supported:
HTTP
health_check:
type: http
url: http://localhost:8080/healthz
expected_status: 200 # default
timeout: 5 # seconds
interval: 10 # seconds between checks
TCP
health_check:
type: tcp
host: localhost
port: 5432
timeout: 3
interval: 10
Process
health_check:
type: process
pid_file: ~/.portoser/run/worker.pid
interval: 10
The process is healthy if the PID file exists and the PID is alive.
Exec
health_check:
type: exec
command: pg_isready -h localhost
expected_exit: 0
timeout: 5
interval: 30
Runs an arbitrary command on the host. Use sparingly — exec checks are the most expensive.
Aggregated health
The web backend aggregates per-service checks into:
- Service health — green / yellow / red per service
- Machine health — derived from the services running on each machine
- Cluster health — derived from machine health
The aggregation lives at web/backend/routers/health.py. The web UI's Health Dashboard (/health route) is the front door.
Real-time stream
Two WebSocket endpoints stream live signals:
| Endpoint | What it streams |
|---|---|
/api/ws/metrics |
Per-device CPU, memory, disk samples; subscription-based |
/ws |
Deployment events: deployment_started, deployment_log, deployment_completed, deployment_failed |
Subscribe to the metrics stream from your own client by sending {"type": "subscribe", "device_id": "<id>"} after connecting.
CLI
portoser health # all services across all hosts
portoser health <service> # one service
portoser health --watch # live, refreshes in place
portoser health --json-output # machine-parsable
What "healthy" actually means
A service is healthy when:
- Its health check has passed at least once
- The most recent check passed within the configured interval × 2
A service is degraded when checks have started failing but the failure budget hasn't been exhausted. Unhealthy when failures exceed the budget.
The budget is configurable per service via health_check.failure_threshold (default 3).
What health monitoring does NOT do
- It does not auto-restart services — that's the job of the self-healing loop, which runs at deploy time.
- It does not page on-call. There's no alerting backend baked in. If you need pages, point a Prometheus or Healthchecks.io scrape at
/api/health/dashboardand configure your own alerts. - It does not store long-term metrics history. Metrics are kept in memory + Redis cache; for retention beyond a few hours, run a real TSDB (Prometheus, VictoriaMetrics).
Wiring external monitoring
To export to Prometheus, point a scrape at:
GET /api/metrics?format=prometheus
For uptime checks, the simplest setup is an external scraper (UptimeRobot, Healthchecks.io, your own cron) hitting /api/health/dashboard and alerting when overall status drops.