Not an expert, so excuse me if this is obvious, but would these integrated graph...

eightysixfour · on Jan 11, 2024

They’re good from the “do it at home” perspective, not from the business or enterprise performance perspective.

One of the ways folks do this now is use the Mac M* chips, since they have so much combined RAM. The raw performance isn’t as high as GPUs, but they can fit substantially larger models in memory.

bb88 · on Jan 11, 2024

The bottleneck would most certainly be memory, as you'll quickly overwhelm the on-die cache, without careful optimization.

That said, I think AMD's chiplet strategy might come into play. I could see AMD release a 4 core 8 thread processor with increased on die cache and other chiplets being neural compute units.

brucethemoose2 · on Jan 11, 2024

People keep reiterating this, but in practice one needs compute and bandwidth, especially outside of tiny context test prompts. On my 4900HS, mlc-llm vulkan is far faster than CPU inference on the same memory bus, with less cache, which wouldn't be the case if it was bandwidth/cache bound (since the CPU has far more cache as well).

My 7800X3D has 96MB of L3 and a golden-bin DDR5 overclock, but its absolutely dreadful for inference.

bb88 · on Jan 12, 2024

I don't disagree here, but the new chips here have special neural compute units, and specifically he's talking about models larger than 24GB.

brucethemoose2 · on Jan 11, 2024

They're slow, but OK for inference.

In practice no one uses AMD/Intel IGPs because no knows about the mlc-llm vulkan backend. llama.cpp is en vogue on the desktop, which does not support IGPs outside of Apple, and otherwise people use backends targeted at server GPUs.