Start from scratch.

For operators who have a GPU but haven't set up any inference software yet. Three commands; ~10 minutes; ~2 GB download.

  1. Step 01

    Check your hardware

    The default model (Llama-3.2-3B Q4_K_M) runs on:

    • NVIDIA: any GPU with 4+ GB VRAM (RTX 3060+, A4000+, T4+)
    • Apple Silicon: M1+ Mac, ~50-100 tok/s
    • CPU only: works on modern x86_64, ~5-10 tok/s

    More VRAM unlocks larger models you can configure later.

  2. Step 02

    Install llama.cpp + a model + the agent

    Run this on the machine that has the GPU. It downloads Llama-3.2-3B (~2 GB), builds llama-server, and installs the Use Pod agent.

    bash <(curl -fsSL https://usepod.ai/start-from-scratch.sh)
  3. Step 03

    Start the model server

    Run llama-server in a long-lived shell. (For production, wrap it in a systemd unit; the install script prints the exact command at the end.)

    llama-server -m ~/.usepod-agent/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf --host 0.0.0.0 --port 8080
  4. Step 04

    Pair

    In a second terminal, generate a pair code:

    usepod-agent setup

    Type the printed code into the pair page.

Already have a backend?

If you already run vLLM, Ollama, LM Studio, or llama.cpp, skip this walkthrough — just install the agent and pair:

curl -fsSL https://usepod.ai/install.sh | sh && usepod-agent setup