So you want to run local AI?
Your own AI on your own machine: private, offline, free. Fork it for your hardware and pick the right model.

Everything downstream depends on this. Fork below for your setup.
Download a GGUF from Hugging Face, then run `llama-cli -m model.gguf`.
A terminal chat gets old fast. Get a harness
Your unified memory IS your VRAM, that is the magic. More RAM, bigger models.
Apple's own framework, usually quicker than llama.cpp here. `pip install mlx-lm`.
`mlx_lm.server` exposes an OpenAI-compatible API, ready for any interface.
VRAM is the whole game. NVIDIA uses CUDA, AMD Radeon uses ROCm or Vulkan.
Like a Mac: one big memory pool, up to 128GB shared. Fits huge models in a quiet box.
CPU-only is slower, not impossible. Small models run fine on a normal laptop.
Stick to 4B and under. Gemma 4 E4B is the CPU champ.
The friendly wrapper over llama.cpp. ollama.com, every OS.