GPU + CPU Split
One host has the GPU. The rest don't. The CPU hosts run APIs, queues, and databases; the GPU host runs inference, training, or anything that benefits from fast tensor math. Portoser routes which service runs where via plain current_host: entries — the GPU isn't a special concept to the orchestrator.
TL;DR
| Hardware | One GPU host (NVIDIA workstation, M-series Mac for MLX, or AMD with ROCm) + 2–6 CPU hosts (Pis, Mac minis, NUCs) |
| GPU options | NVIDIA (CUDA), Apple Silicon (MLX), AMD (ROCm) — choose by what your inference stack supports |
| Network | At least 1 GbE; the GPU host serves predictions, so latency over the LAN matters |
| Storage | Models can be tens of GB; put them on the GPU host's local SSD, not over NFS |
| Time to first split | ~1–2 hours including model download |
| Biggest gotcha | Sharing a GPU between several services is harder than the registry suggests — you may need to serialize requests through one inference service rather than running multiple GPU services side by side |
Why pick this shape
- One expensive machine, several cheap ones. You don't need a GPU on every node.
- The GPU host is the only place that needs the heavy CUDA / Metal / ROCm install. Failure on a Pi can't break inference.
- Inference and training workloads differ from web workloads — they want big batches, not many small requests. Splitting them off lets you tune each host for its real job.
Registry skeleton
hosts:
gpu-host:
ip: 192.168.1.20
arch: amd64-linux
ssh_user: ml
path: /home/ml/services
roles: [inference, training]
mini1:
ip: 192.168.1.10
arch: arm64-apple
roles: [infrastructure, api]
pi1:
arch: arm64-linux
roles: [queue]
pi2:
arch: arm64-linux
roles: [databases]
services:
llm-inference:
hostname: llm.internal
current_host: gpu-host
deployment_type: docker
docker_compose: /llm_inference/docker-compose.yml
port: 9001
healthcheck_url: http://llm.internal/health
llm-router:
hostname: llm-router.internal
current_host: mini1
deployment_type: local
service_file: /llm_router/service.yml
port: 8500
# The router accepts requests, queues them, and forwards to llm-inference.
# Living on a CPU host means it can be cheap and replicated; the GPU
# host stays focused on the actual work.
embeddings:
hostname: embeddings.internal
current_host: gpu-host
deployment_type: docker
docker_compose: /embeddings/docker-compose.yml
port: 9101
current_host: gpu-host is the entire orchestration story. There is no "GPU resource" abstraction in Portoser — the registry just says "this service runs there."
Apple Silicon GPU (MLX)
If your GPU host is an M-series Mac mini or Studio, you don't need Docker GPU passthrough — MLX uses Metal directly from the host process. Run inference services as local (uv-managed Python) so they have raw access to the GPU.
The example registry already does this:
services:
mlx-inference:
hostname: mlx-inference.internal
current_host: mini2
deployment_type: local
service_file: /mlx_inference/service.yml
port: 8999
lib/local.sh runs the entrypoint outside any container, so MLX talks to Metal at full speed. Don't try to put MLX inside Docker Desktop on macOS — it works in benchmarks and falls over in production.
NVIDIA GPU on Linux
For an amd64-linux box with an NVIDIA card, install the NVIDIA Container Toolkit on the GPU host and let deployment_type: docker use it. The compose file is where you opt in:
# In /llm_inference/docker-compose.yml on the GPU host
services:
llm-inference:
image: <your-llm-image>
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Portoser doesn't manage GPU reservation — Docker does. The orchestrator's job is to make sure the right compose file lands on the right host. From there, NVIDIA's runtime handles the rest.
Sharing one GPU across services
You can have multiple GPU services on the same host. What you can't do is run them concurrently at full throttle without VRAM accounting. Two patterns:
Single inference service, batching upstream
The router on the CPU side queues requests and forwards them in batches. Only one inference service runs on the GPU host. This is the simpler pattern and what most production setups end up at.
Multiple inference services, one workload at a time
Two inference services on the same GPU host, each handling a different model. Use Docker memory limits and arrange your routing so that requests for model A and model B don't overlap. Portoser's deployment history will show you each deploy clearly; the runtime conflict between them is yours to manage.
If you find yourself wanting "Kubernetes-style" GPU-aware scheduling, you've outgrown this shape — see the limits section below.
Bringing it up
# 1. Make sure GPU drivers + container toolkit are installed on gpu-host
ssh gpu-host 'nvidia-smi' # should print GPU info
ssh gpu-host 'docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi'
# 2. Deploy the inference service
./portoser deploy gpu-host llm-inference
# 3. Deploy the router on a CPU host
./portoser deploy mini1 llm-router
# 4. Watch resource usage
./portoser metrics
ssh gpu-host 'nvidia-smi -l 5' # live GPU utilization
For MLX on macOS, no toolkit is needed — brew install whatever your inference stack is, then ./portoser deploy mini2 mlx-inference.
Health, metrics, and the GPU
./portoser metrics gives you CPU, RAM, and disk per service. It does not track GPU utilization or VRAM. For that you'll want one of:
nvidia-smifrom the GPU host- A Prometheus exporter (DCGM exporter for NVIDIA,
mactopfor Apple Silicon) - Whatever your inference framework exposes (vLLM has built-in
/metrics, MLX inference services typically expose token throughput)
You can register any of these as their own Portoser service so they show up in the dependency graph and benefit from health checks.
Where this shape falls down
- A single GPU host is a single point of failure. If you need redundancy, you need two GPU hosts and a router that can fail over — Portoser routes statically via
current_host, so you'd implement the failover at the application layer. - Multi-tenancy across users / workloads is not a thing Portoser tries to solve. If your team is at the point where two people need different model versions running side by side with quotas, you're in scheduler territory.
- Heavy training jobs that run for hours change the cluster's character. Health checks tuned for fast inference may flap during training. Either tune them up, or run training as a
locallong-running process the orchestrator just leaves alone (./portoser stop/startfor lifecycle).
Next
- Mixed Architecture Cluster — covers the cross-OS / cross-arch wiring this shape sits inside
- Operations: Health Monitoring — what the orchestrator checks, what it doesn't
- Deployment Types —
localfor MLX,dockerfor NVIDIA