FWIW - I need to remeasure but - IIRC my system with a 4090 only uses ~500w (may... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		filterfiber on Dec 11, 2023 \| parent \| context \| favorite \| on: Mistral: Our first AI endpoints are available in e... FWIW - I need to remeasure but - IIRC my system with a 4090 only uses ~500w (maybe up to 600w) during inference of LLMs, the LLMs have a lot harder time saturating the compute compared to stable diffusion I'm assuming because of the VRAM speed (and this is all on-card, nothing swapping from system memory). The 4090 itself only really used 300~400w most of the time because of this. If you consider 600w for the entire system, that's only 6kWh/1M token, for me 6kWh @0.2USD/kWh is 1.2USD/1M tokens. And that's without the power efficiency improvements that an H100 has over the 4090. So I think 2$/1M should be achievable once you combine the efficiencies of H100s+batching, etc. Since LLM's generally dwarf the network delay anyway, you could host in places like washington for dirt cheap prices (their residential prices are almost half of what I used for calculations)

modeless on Dec 11, 2023 | [–]

Are you using batch size 1 with LLMs? Larger batch sizes get much higher utilization.

huytersd on Dec 11, 2023 | [–]

Well with those numbers, I pay $0.1/kWh so theoretically $0.6/1M tokens

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact