Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you take input tokens in consideration is more like 5.25 eur vs. 1.5 eur / million tokens overall.

Mistral-small seems to be the most direct competitor to gpt-3.5 and it’s cheaper (1.2 eur / million tokens)

Note: I’m assuming equal weight for input and output tokens, and cannot see the prices in USD :/



Does the 8x7B model really perform at a GPT-3.5 level? That means we might see GPT-3.5 models running locally on our phones in a few years.


That might be happening in a few weeks. There is a credible claim that this model might be compressible to as little as a 4GB memory footprint.


You mean the 7B one? That's exciting if true, but if compression means it can do 0.1 token/sec,it doesn't do much for anyone.


No, I am referring to the 7Bx8 MoE model. The MoE layers apparently can be sparsified (or equivalently, quantized down to a single bit per weight) with minimal loss of quality.

Inference on a quantized model is faster, not slower.

However, I have no idea how practical it is to run a LLM on a phone. I think it would run hot and waste the battery.


Really? Well that's very exciting. I don't care about wasting my battery if it can do my menial tasks for me, battery is a currency I'd gladly pay for this use case.


I think we're at least a couple generations away where this is feasible for these models, unless say it's for performing a limited number of background tasks at fairly slow inference speed. SOC power draw limits will probably limit inference speed to about 2-5 tok/sec (lower end for Mixtral which has the processing requirements of a 14B) and would suck an iPhone Max dry in about an hour.


Maybe this way, we'll not just get user-replaceable batteries in smartphones back - maybe we'll get hot-swappable batteries for phones, as everyone will be happy to carry a bag of extra batteries if it means using advanced AI capabilities for the whole day, instead of 15 minutes.


Or maybe we'll finally get better batteries!

Though I guess that's not for lack of trying.


Not true. Not everyone is building chat bot or similar interface that requires output with latency low enough for a user. While your example is of course incredibly slow, there are still many interesting things that could be done if it was a little bit quicker.


What kind of use cases run in an environment where latency isn't important (some kind of batch process?) but don't have more than 4GB of RAM?


Price sensitive ones, or cases where you want the new capability but can't get any new infrastructure.


Not LLMs, but locally running facial and object recognition models on your phone's gallery, to build up a database for face/object search in the gallery app? I'm half-convinced this is how Samsung does it, but I can't really be sure of much, because all the photo AI stuff works weirdly and in unobservable way, probably because of some EU ruling.

(That one is a curious case. I once spent some time trying to figure out why no major photo app seems to support manually tagging faces, which is a mind-dumbingly obvious feature to support, and which was something supported by software a decade or so ago. I couldn't find anything definitive; there's this eerie conspiracy of silence on the topic, that made me doubt my own sanity at times. Eventually, I dug up hints that some EU ruling/regs related to facial recognition led everyone to remove or geolock this feature. Still nothing specific, though.)


How/where do you stay up to date with this stuff?


https://www.reddit.com/r/LocalLLaMA/ is pretty good, it is a bit fanboy-ey, but those kinds of sites are where you get the good news.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: