GuidesSo you want to run local AI?

So you want to run local AI?

Your own AI on your own machine: private, offline, free. Fork it for your hardware and pick the right model.

#ai #local #llm #privacy

T@toast· 16 notes· 6 forks

Apple Silicon · by @toast

MLX (native) · by @toast

NVIDIA or AMD GPU · by @toast

Strix Halo · by @toast

No GPU · by @toast

Ollama (easy mode) · by @toast

STEP 01·Original

Pick your hardware

Everything downstream depends on this. Fork below for your setup.

+ more · click to read

T@toast 4

STEP 02·Original

Install llama.cpp

The engine itself: lean, fast, full control. llama.cpp runs everywhere.

+ more · click to read

T@toast 1

STEP 03·Original

Grab a GGUF and run it

Download a GGUF from Hugging Face, then run `llama-cli -m model.gguf`.

+ more · click to read

T@toast

STEP 04·Original

Give it a real interface

A terminal chat gets old fast. Get a harness

+ more · click to read

T@toast

STEP 05·Original

Level up

Bigger models, faster serving, your own fine-tunes. vLLM for multi-user serving.

+ more · click to read

T@toast

STEP 2a·Apple Silicon

On a Mac (M-series)

Your unified memory IS your VRAM, that is the magic. More RAM, bigger models.

+ more · click to read

T@toast

STEP 2b·Apple Silicon

Models for your Mac

Match the model to your unified RAM. Qwen 3.6 is the best all-rounder.

+ more · click to read

T@toast 1

STEP 2b.a·MLX (native)

Or run it with MLX, faster on Mac

Apple's own framework, usually quicker than llama.cpp here. `pip install mlx-lm`.

+ more · click to read

T@toast

STEP 2b.b·MLX (native)

Serve an MLX model

`mlx_lm.server` exposes an OpenAI-compatible API, ready for any interface.

+ more · click to read

T@toast

STEP 2a·NVIDIA or AMD GPU

Your GPU

VRAM is the whole game. NVIDIA uses CUDA, AMD Radeon uses ROCm or Vulkan.

+ more · click to read

T@toast

STEP 2b·NVIDIA or AMD GPU

Models for your VRAM

Pick by card size. Q4_K_M quantization stretches what fits. Qwen 3.6 scales well.

+ more · click to read

T@toast

STEP 2a·Strix Halo

AMD Strix Halo (unified memory)

Like a Mac: one big memory pool, up to 128GB shared. Fits huge models in a quiet box.

+ more · click to read

T@toast

STEP 2b·Strix Halo

Models for a unified-memory box

With ~96GB you are not VRAM-limited. Run the big ones. Qwen 3.6 shines.

+ more · click to read

T@toast

STEP 2a·No GPU

No GPU? Still works

CPU-only is slower, not impossible. Small models run fine on a normal laptop.

+ more · click to read

T@toast

STEP 2b·No GPU

Tiny models that run anywhere

Stick to 4B and under. Gemma 4 E4B is the CPU champ.

+ more · click to read

T@toast

STEP 3a·Ollama (easy mode)

Prefer Ollama? One command

The friendly wrapper over llama.cpp. ollama.com, every OS.

+ more · click to read

T@toast