The Self-Healing Loop

Every portoser deploy runs through a four-stage loop: observe → analyze → solve → learn. Each stage is its own script in lib/. The loop is on by default and can be opted out per deploy with --no-auto-heal.

This is the part of Portoser that earns its keep. Most orchestrators stop at "deploy failed." Portoser tries to figure out why and what to do about it from a growing library of fixes.

A clean portoser deploy: observe finds no problems and the service starts on the first try.

Stage 1 — Observe (lib/observe/observer.sh)

Before deciding anything is wrong, Portoser collects facts about the host and the service:

  • Disk free space and inode pressure
  • Memory and swap usage
  • Process health (PID alive? zombied? OOM-killed?)
  • Port binding (is the expected port held by the expected process?)
  • Container state (running, restarting, exited, paused)
  • Network reachability of declared dependencies

These observations are written as structured events the next phase can match against.

Stage 2 — Analyze (lib/diagnose/analyzer.sh)

The diagnoser matches observations against a library of problem fingerprints. Examples that ship today:

  • Port already in use — another process binds the service's port
  • Dependency not ready — a declared dependency hasn't passed its health check
  • Docker daemon not running
  • Stale process holding files — leftover from a previous run
  • Disk space exhausted
  • SSH unreachable — the worker host stopped answering
  • Permission denied on bind-mount or socket

When a fingerprint matches, the diagnoser produces a structured problem record with confidence and the observation evidence that triggered it.

Stage 3 — Solve (lib/solve/solver.sh and lib/solve/patterns/)

For each known problem fingerprint there is a corresponding solution pattern. The solver runs the matching pattern script. Examples:

  • Port conflict → identify the holder, prompt for kill, retry bind
  • Stale process → terminate, clean PID file, retry start
  • Disk space → run cleanup pattern (prune images, rotate logs), then retry
  • Docker daemon → start it, wait for socket, retry
  • Dependency not ready → wait with bounded backoff, retry health check

Solutions are intentionally conservative — they retry the deploy instead of forcing fixes that could mask real bugs.

Stage 4 — Learn (lib/standardize/learning.sh)

Every successful resolution is recorded as a playbook in ~/.portoser/knowledge/playbooks/, and a frequency map at ~/.portoser/knowledge/problem_frequency.txt is updated.

Two outcomes:

  1. Per-cluster memory — the next time the same fingerprint appears, the solver tries the recorded playbook first.
  2. Visible patterns — recurring problems surface as high-frequency entries, often pointing at a config bug worth fixing instead of patching repeatedly.

The Knowledge Base UI (web/frontend/src/pages/KnowledgeBase.jsx) reads from this directory.

How it runs

The loop is invoked by lib/intelligent_deploy.sh and is the default path for portoser deploy. To deploy without the auto-fix step (e.g. when you want to see the raw failure), use:

portoser deploy <machine> <service> --no-auto-heal

For investigating without deploying, the analyzer is exposed directly:

portoser diagnose <service>

To see what the loop has learned so far:

portoser learn summary               # one-screen overview
portoser learn stats --json-output   # full stats
portoser learn playbooks             # every recorded playbook
portoser learn insights <service>    # what's failed, what got auto-fixed

What it isn't

  • Not an autonomous agent. The loop runs only as part of a deploy or when you invoke a phase. It does not poll your cluster looking for trouble in the background.
  • Not magic. It can only solve problems whose fingerprint is in the catalog. New shapes of failure are recorded as observations and surface as unmatched problems for you to triage.
  • Not a replacement for monitoring. It runs at deploy-time. Continuous health is handled by the health monitoring subsystem.

Adding your own pattern

Patterns are shell scripts that follow a small contract. Drop a new file in lib/solve/patterns/ named after the fingerprint, expose a solve() function, and the solver will pick it up. See lib/solve/patterns/port_conflict.sh for the canonical example.