Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The most popular performant desktop llm runtimes are pure GPU (exLLAMA) or GPU + CPU (llama.cpp). People stuff the biggest models that will fit into the collective RAM + VRAM pool, up to ~48GB for llama 70B. Sometimes users will split models across two 24GB CUDA GPUs.

Inference for me is bottlenecked by my GPU and CPU RAM bandwidth. TBH the biggest frustration is that OEMs like y'all can't double up VRAM like you could in the old days, or sell platforms with a beefy IGP, and that quad channel+ CPUs are too expensive.

Vulkan is a popular target runtimes seem to be heading for. IGPs with access to lots of memory capacity + bandwidth will be very desirable, and I hear Intel/AMD are cooking up quad channel IGPs.

On the server side, everyone is running Nvidia boxes, I guess. But I had a dream about an affordable llama.cpp host: The cheapest Sapphire Rapids HBM SKUs, with no DIMM slots, on a tiny, dirt cheap motherboard you can pack into a rack like sardines. Llama.cpp is bottlenecked by bandwidth, and ~64GB is perfect.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: