GPU + CPU Split

One host has the GPU. The rest don't. The CPU hosts run APIs, queues, and databases; the GPU host runs inference, training, or anything that benefits from fast tensor math. Portoser routes which service runs where via plain current_host: entries — the GPU isn't a special concept to the orchestrator.

TL;DR

Hardware One GPU host (NVIDIA workstation, M-series Mac for MLX, or AMD with ROCm) + 2–6 CPU hosts (Pis, Mac minis, NUCs)
GPU options NVIDIA (CUDA), Apple Silicon (MLX), AMD (ROCm) — choose by what your inference stack supports
Network At least 1 GbE; the GPU host serves predictions, so latency over the LAN matters
Storage Models can be tens of GB; put them on the GPU host's local SSD, not over NFS
Time to first split ~1–2 hours including model download
Biggest gotcha Sharing a GPU between several services is harder than the registry suggests — you may need to serialize requests through one inference service rather than running multiple GPU services side by side

Why pick this shape

  • One expensive machine, several cheap ones. You don't need a GPU on every node.
  • The GPU host is the only place that needs the heavy CUDA / Metal / ROCm install. Failure on a Pi can't break inference.
  • Inference and training workloads differ from web workloads — they want big batches, not many small requests. Splitting them off lets you tune each host for its real job.

Registry skeleton

hosts:
  gpu-host:
    ip: 192.168.1.20
    arch: amd64-linux
    ssh_user: ml
    path: /home/ml/services
    roles: [inference, training]
  mini1:
    ip: 192.168.1.10
    arch: arm64-apple
    roles: [infrastructure, api]
  pi1:
    arch: arm64-linux
    roles: [queue]
  pi2:
    arch: arm64-linux
    roles: [databases]

services:
  llm-inference:
    hostname: llm.internal
    current_host: gpu-host
    deployment_type: docker
    docker_compose: /llm_inference/docker-compose.yml
    port: 9001
    healthcheck_url: http://llm.internal/health

  llm-router:
    hostname: llm-router.internal
    current_host: mini1
    deployment_type: local
    service_file: /llm_router/service.yml
    port: 8500
    # The router accepts requests, queues them, and forwards to llm-inference.
    # Living on a CPU host means it can be cheap and replicated; the GPU
    # host stays focused on the actual work.

  embeddings:
    hostname: embeddings.internal
    current_host: gpu-host
    deployment_type: docker
    docker_compose: /embeddings/docker-compose.yml
    port: 9101

current_host: gpu-host is the entire orchestration story. There is no "GPU resource" abstraction in Portoser — the registry just says "this service runs there."

Apple Silicon GPU (MLX)

If your GPU host is an M-series Mac mini or Studio, you don't need Docker GPU passthrough — MLX uses Metal directly from the host process. Run inference services as local (uv-managed Python) so they have raw access to the GPU.

The example registry already does this:

services:
  mlx-inference:
    hostname: mlx-inference.internal
    current_host: mini2
    deployment_type: local
    service_file: /mlx_inference/service.yml
    port: 8999

lib/local.sh runs the entrypoint outside any container, so MLX talks to Metal at full speed. Don't try to put MLX inside Docker Desktop on macOS — it works in benchmarks and falls over in production.

NVIDIA GPU on Linux

For an amd64-linux box with an NVIDIA card, install the NVIDIA Container Toolkit on the GPU host and let deployment_type: docker use it. The compose file is where you opt in:

# In /llm_inference/docker-compose.yml on the GPU host
services:
  llm-inference:
    image: <your-llm-image>
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Portoser doesn't manage GPU reservation — Docker does. The orchestrator's job is to make sure the right compose file lands on the right host. From there, NVIDIA's runtime handles the rest.

Sharing one GPU across services

You can have multiple GPU services on the same host. What you can't do is run them concurrently at full throttle without VRAM accounting. Two patterns:

Single inference service, batching upstream

The router on the CPU side queues requests and forwards them in batches. Only one inference service runs on the GPU host. This is the simpler pattern and what most production setups end up at.

Multiple inference services, one workload at a time

Two inference services on the same GPU host, each handling a different model. Use Docker memory limits and arrange your routing so that requests for model A and model B don't overlap. Portoser's deployment history will show you each deploy clearly; the runtime conflict between them is yours to manage.

If you find yourself wanting "Kubernetes-style" GPU-aware scheduling, you've outgrown this shape — see the limits section below.

Bringing it up

# 1. Make sure GPU drivers + container toolkit are installed on gpu-host
ssh gpu-host 'nvidia-smi'   # should print GPU info
ssh gpu-host 'docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi'

# 2. Deploy the inference service
./portoser deploy gpu-host llm-inference

# 3. Deploy the router on a CPU host
./portoser deploy mini1 llm-router

# 4. Watch resource usage
./portoser metrics
ssh gpu-host 'nvidia-smi -l 5'    # live GPU utilization

For MLX on macOS, no toolkit is needed — brew install whatever your inference stack is, then ./portoser deploy mini2 mlx-inference.

Health, metrics, and the GPU

./portoser metrics gives you CPU, RAM, and disk per service. It does not track GPU utilization or VRAM. For that you'll want one of:

  • nvidia-smi from the GPU host
  • A Prometheus exporter (DCGM exporter for NVIDIA, mactop for Apple Silicon)
  • Whatever your inference framework exposes (vLLM has built-in /metrics, MLX inference services typically expose token throughput)

You can register any of these as their own Portoser service so they show up in the dependency graph and benefit from health checks.

Where this shape falls down

  • A single GPU host is a single point of failure. If you need redundancy, you need two GPU hosts and a router that can fail over — Portoser routes statically via current_host, so you'd implement the failover at the application layer.
  • Multi-tenancy across users / workloads is not a thing Portoser tries to solve. If your team is at the point where two people need different model versions running side by side with quotas, you're in scheduler territory.
  • Heavy training jobs that run for hours change the cluster's character. Health checks tuned for fast inference may flap during training. Either tune them up, or run training as a local long-running process the orchestrator just leaves alone (./portoser stop/start for lifecycle).

Next