CXL is that, but a compromise. Practically, that means an APU. Intel/AMD are rep...

bick_nyers · on Aug 9, 2023

At the end of the day it's limited by the PCIE lanes it is attached to, no different from a 2nd GPU.

Just saw one yesterday, 128GB PCIE 5.0 x8 which is a maximum of 32 GB/s in and out, where the 4090 has 1 TB/s memory bandwidth.

IMO for inference, dual socket EPYC (24 channel DDR5) is the way to go and CXL can theoretically allow you to bump up the bandwidth in that situation (assuming that you can optimize the software properly). Already in llama.cpp there seems to be some issues using 2P servers regarding numa.