Start from scratch.
For operators who have a GPU but haven't set up any inference software yet. Three commands; ~10 minutes; ~2 GB download.
- Step 01
Check your hardware
The default model (Llama-3.2-3B Q4_K_M) runs on:
- NVIDIA: any GPU with 4+ GB VRAM (RTX 3060+, A4000+, T4+)
- Apple Silicon: M1+ Mac, ~50-100 tok/s
- CPU only: works on modern x86_64, ~5-10 tok/s
More VRAM unlocks larger models you can configure later.
- Step 02
Install llama.cpp + a model + the agent
Run this on the machine that has the GPU. It downloads Llama-3.2-3B (~2 GB), builds llama-server, and installs the Use Pod agent.
bash <(curl -fsSL https://usepod.ai/start-from-scratch.sh) - Step 03
Start the model server
Run llama-server in a long-lived shell. (For production, wrap it in a systemd unit; the install script prints the exact command at the end.)
llama-server -m ~/.usepod-agent/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf --host 0.0.0.0 --port 8080 - Step 04
Pair
In a second terminal, generate a pair code:
usepod-agent setupType the printed code into the pair page.
Already have a backend?
If you already run vLLM, Ollama, LM Studio, or llama.cpp, skip this walkthrough — just install the agent and pair:
curl -fsSL https://usepod.ai/install.sh | sh && usepod-agent setup