Forkable Notes
New guideSign in
GuidesSo you want to run local AI?

So you want to run local AI?

Your own AI on your own machine: private, offline, free. Fork it for your hardware and pick the right model.

T@toast· 16 notes· 6 forks
Apple Silicon · by @toast
MLX (native) · by @toast
NVIDIA or AMD GPU · by @toast
Strix Halo · by @toast
No GPU · by @toast
Ollama (easy mode) · by @toast
1
STEP 01·Original
Pick your hardware

Everything downstream depends on this. Fork below for your setup.

+ more · click to read
2
STEP 02·Original
Install llama.cpp

The engine itself: lean, fast, full control. llama.cpp runs everywhere.

+ more · click to read
3
STEP 03·Original
Grab a GGUF and run it

Download a GGUF from Hugging Face, then run `llama-cli -m model.gguf`.

+ more · click to read
4
STEP 04·Original
Give it a real interface

A terminal chat gets old fast. Get a harness

+ more · click to read
5
STEP 05·Original
Level up

Bigger models, faster serving, your own fine-tunes. vLLM for multi-user serving.

+ more · click to read
STEP 2a·Apple Silicon
On a Mac (M-series)

Your unified memory IS your VRAM, that is the magic. More RAM, bigger models.

+ more · click to read
STEP 2b·Apple Silicon
Models for your Mac

Match the model to your unified RAM. Qwen 3.6 is the best all-rounder.

+ more · click to read
STEP 2b.a·MLX (native)
Or run it with MLX, faster on Mac

Apple's own framework, usually quicker than llama.cpp here. `pip install mlx-lm`.

+ more · click to read
STEP 2b.b·MLX (native)
Serve an MLX model

`mlx_lm.server` exposes an OpenAI-compatible API, ready for any interface.

+ more · click to read
STEP 2a·NVIDIA or AMD GPU
Your GPU

VRAM is the whole game. NVIDIA uses CUDA, AMD Radeon uses ROCm or Vulkan.

+ more · click to read
STEP 2b·NVIDIA or AMD GPU
Models for your VRAM

Pick by card size. Q4_K_M quantization stretches what fits. Qwen 3.6 scales well.

+ more · click to read
STEP 2a·Strix Halo
AMD Strix Halo (unified memory)

Like a Mac: one big memory pool, up to 128GB shared. Fits huge models in a quiet box.

+ more · click to read
STEP 2b·Strix Halo
Models for a unified-memory box

With ~96GB you are not VRAM-limited. Run the big ones. Qwen 3.6 shines.

+ more · click to read
STEP 2a·No GPU
No GPU? Still works

CPU-only is slower, not impossible. Small models run fine on a normal laptop.

+ more · click to read
STEP 2b·No GPU
Tiny models that run anywhere

Stick to 4B and under. Gemma 4 E4B is the CPU champ.

+ more · click to read
STEP 3a·Ollama (easy mode)
Prefer Ollama? One command

The friendly wrapper over llama.cpp. ollama.com, every OS.

+ more · click to read