Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

FWIW - I need to remeasure but - IIRC my system with a 4090 only uses ~500w (maybe up to 600w) during inference of LLMs, the LLMs have a lot harder time saturating the compute compared to stable diffusion I'm assuming because of the VRAM speed (and this is all on-card, nothing swapping from system memory). The 4090 itself only really used 300~400w most of the time because of this.

If you consider 600w for the entire system, that's only 6kWh/1M token, for me 6kWh @0.2USD/kWh is 1.2USD/1M tokens.

And that's without the power efficiency improvements that an H100 has over the 4090. So I think 2$/1M should be achievable once you combine the efficiencies of H100s+batching, etc. Since LLM's generally dwarf the network delay anyway, you could host in places like washington for dirt cheap prices (their residential prices are almost half of what I used for calculations)



Are you using batch size 1 with LLMs? Larger batch sizes get much higher utilization.


Well with those numbers, I pay $0.1/kWh so theoretically $0.6/1M tokens




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: