FWIW - I need to remeasure but - IIRC my system with a 4090 only uses ~500w (maybe up to 600w) during inference of LLMs, the LLMs have a lot harder time saturating the compute compared to stable diffusion I'm assuming because of the VRAM speed (and this is all on-card, nothing swapping from system memory). The 4090 itself only really used 300~400w most of the time because of this.
If you consider 600w for the entire system, that's only 6kWh/1M token, for me 6kWh @0.2USD/kWh is 1.2USD/1M tokens.
And that's without the power efficiency improvements that an H100 has over the 4090. So I think 2$/1M should be achievable once you combine the efficiencies of H100s+batching, etc. Since LLM's generally dwarf the network delay anyway, you could host in places like washington for dirt cheap prices (their residential prices are almost half of what I used for calculations)
Depends how and where you source your energy. If you invest in your own solar panels and batteries, all that energy is essentially fixed price (cost of the infrastructure) amortized over the lifetime of the setup (1-2 decades or so). Maybe you have some variable pricing on top for grid connectivity and use the grid as a fallback. But there's also the notion of selling excess energy back to the grid that offsets that.
So, 10kwh could be a lot less than what you cite. That's also how grid operators make money. They generate cheaply and sell with a nice margin. Prices are determined by the most expensive energy sources on the grid in some markets (coal, nuclear, etc.). So, that pricing doesn't reflect actual cost for renewables, which is typically a lot lower than that. Anyone consuming large amounts of energy will be looking to cut their cost. For data centers that typically means investing in energy generation, storage, and efficient hardware and cooling.
During the crypto boom there were crypto miners in China who got really cheap electricity from hydroelectric dams built in rural areas. Shipping electricity long distance is expensive (both in terms of infrastructure and losses - unless you pay even more for HVDC infrastructure), so they were able to get great prices as local consumers of "surplus" energy.
That might be a great opportunity for cheap LLMs too.
Batching changes that equation a fair bit. Also these cards will not consume full power since llm are mostly limited by memory bandwidth and the processing part will get some idle time.
Is $0.2-0.4/kWh a good estimate for price paid in a data center? That’s pretty expensive for energy, and I think vPPA prices at big data centers are much lower (I think 0.1 is a decent upper bound in the US, though I could see EU being more expensive by 2x).
If comparing apples to apples, the 4090 needs to clock up and consume about 450 W to match the A100 at 350W. Part of that is due to being able to run larger batches on the A100, which gives it an additional performance edge, but yes in general the A100 is more power efficient.
Mistral-small explicitly has inference costs of a 12.9b, but more than that, it's probably ran with batch size of 32 or higher. They'll worry more about offsetting training costs than about this.
No, I am referring to the 7Bx8 MoE model. The MoE layers apparently can be sparsified (or equivalently, quantized down to a single bit per weight) with minimal loss of quality.
Inference on a quantized model is faster, not slower.
However, I have no idea how practical it is to run a LLM on a phone. I think it would run hot and waste the battery.
Really? Well that's very exciting. I don't care about wasting my battery if it can do my menial tasks for me, battery is a currency I'd gladly pay for this use case.
I think we're at least a couple generations away where this is feasible for these models, unless say it's for performing a limited number of background tasks at fairly slow inference speed. SOC power draw limits will probably limit inference speed to about 2-5 tok/sec (lower end for Mixtral which has the processing requirements of a 14B) and would suck an iPhone Max dry in about an hour.
Maybe this way, we'll not just get user-replaceable batteries in smartphones back - maybe we'll get hot-swappable batteries for phones, as everyone will be happy to carry a bag of extra batteries if it means using advanced AI capabilities for the whole day, instead of 15 minutes.
Not true. Not everyone is building chat bot or similar interface that requires output with latency low enough for a user. While your example is of course incredibly slow, there are still many interesting things that could be done if it was a little bit quicker.
Not LLMs, but locally running facial and object recognition models on your phone's gallery, to build up a database for face/object search in the gallery app? I'm half-convinced this is how Samsung does it, but I can't really be sure of much, because all the photo AI stuff works weirdly and in unobservable way, probably because of some EU ruling.
(That one is a curious case. I once spent some time trying to figure out why no major photo app seems to support manually tagging faces, which is a mind-dumbingly obvious feature to support, and which was something supported by software a decade or so ago. I couldn't find anything definitive; there's this eerie conspiracy of silence on the topic, that made me doubt my own sanity at times. Eventually, I dug up hints that some EU ruling/regs related to facial recognition led everyone to remove or geolock this feature. Still nothing specific, though.)
I don’t think it’s safe to assume any of this. It’s still limited release which reads as invite only. Once it hits some kind of GA then we can test and verify.
I understand how Mistral could end up being the most popular open source LLM model for the foreseeable future. What I cannot understand is who they expect to convince to pay for their API. As long as you are shipping your data to a third-party, whether they are running an open or closed source model is inconsequential.
I pay for hosted databases all the time. It’s more convenient. But those same databases are popular because they are open source.
I also know that because it’s open source, if I ever have a need to, I can host it on my own servers. Currently I don’t have that need, but it’s nice to know that it’s in the cards.
Open source databases are SOTA or very close to it, though. Here the value proposition is to pay 10-50% less for an inferior product. Portability is definitely an advantage, but that's another aspect which I think detracts from their value: if I can run this anywhere, I will either host it myself or pay whoever can make it happen very cheap. Even OpenAI could host an API for Mistral.
> Here the value proposition is to pay 10-50% less for an inferior product.
OpenAI just went through an existential crisis where the company almost collapsed. They are also quite unreliable. For some use cases, I'll take a service that does slightly worse on outputs, but much better on reliability. For example, if I'm building a customer service chat bot, it's a pretty big deal if the LLM backend goes down. With an open-source model, I can build it using the cloud provider. If they are a reliable host, i'll probably stick with them as i grow. If not, I always have the option of running the model myself. This alleviates a lot of the risk.
You may be fine with shipping your data to OpenAI or Mistral, but worry about what happens if they change terms or if their future models change in a way that causes problems for you, or if they go bankrupt. In any of those cases, knowing you can take the model and run it yourself (or hire someone else to run it for you) mitigates risk. Whether those risks matter enough will of course differ wildly.
If I'm happy with my infrastructure being built on top of the potential energy of a loadbearing rugpull, I'd probably stick with OpenAI in the average use case.
Pricing has been released too.
Per 1 million output tokens:
Mistral-medium $8
Mistral-small $1.94
gpt-3.5-turbo-1106 $2
gpt-4-1106-preview $30
gpt-4 $60
gpt-4-32k $120
This suggests that they’re reasonably confident that the mistral-medium model is substantially better than gpt3-5