If you take input tokens in consideration is more like 5.25 eur vs. 1.5 eur / mi...

stavros · on Dec 11, 2023

Does the 8x7B model really perform at a GPT-3.5 level? That means we might see GPT-3.5 models running locally on our phones in a few years.

anon373839 · on Dec 11, 2023

That might be happening in a few weeks. There is a credible claim that this model might be compressible to as little as a 4GB memory footprint.

stavros · on Dec 11, 2023

You mean the 7B one? That's exciting if true, but if compression means it can do 0.1 token/sec,it doesn't do much for anyone.

anon373839 · on Dec 11, 2023

No, I am referring to the 7Bx8 MoE model. The MoE layers apparently can be sparsified (or equivalently, quantized down to a single bit per weight) with minimal loss of quality.

Inference on a quantized model is faster, not slower.

However, I have no idea how practical it is to run a LLM on a phone. I think it would run hot and waste the battery.

stavros · on Dec 11, 2023

Really? Well that's very exciting. I don't care about wasting my battery if it can do my menial tasks for me, battery is a currency I'd gladly pay for this use case.

brandall10 · on Dec 11, 2023

I think we're at least a couple generations away where this is feasible for these models, unless say it's for performing a limited number of background tasks at fairly slow inference speed. SOC power draw limits will probably limit inference speed to about 2-5 tok/sec (lower end for Mixtral which has the processing requirements of a 14B) and would suck an iPhone Max dry in about an hour.

TeMPOraL · on Dec 11, 2023

Maybe this way, we'll not just get user-replaceable batteries in smartphones back - maybe we'll get hot-swappable batteries for phones, as everyone will be happy to carry a bag of extra batteries if it means using advanced AI capabilities for the whole day, instead of 15 minutes.

stavros · on Dec 11, 2023

Or maybe we'll finally get better batteries!

Though I guess that's not for lack of trying.

infecto · on Dec 11, 2023

Not true. Not everyone is building chat bot or similar interface that requires output with latency low enough for a user. While your example is of course incredibly slow, there are still many interesting things that could be done if it was a little bit quicker.

stavros · on Dec 11, 2023

What kind of use cases run in an environment where latency isn't important (some kind of batch process?) but don't have more than 4GB of RAM?

wongarsu · on Dec 11, 2023

Price sensitive ones, or cases where you want the new capability but can't get any new infrastructure.

TeMPOraL · on Dec 11, 2023

Not LLMs, but locally running facial and object recognition models on your phone's gallery, to build up a database for face/object search in the gallery app? I'm half-convinced this is how Samsung does it, but I can't really be sure of much, because all the photo AI stuff works weirdly and in unobservable way, probably because of some EU ruling.

(That one is a curious case. I once spent some time trying to figure out why no major photo app seems to support manually tagging faces, which is a mind-dumbingly obvious feature to support, and which was something supported by software a decade or so ago. I couldn't find anything definitive; there's this eerie conspiracy of silence on the topic, that made me doubt my own sanity at times. Eventually, I dug up hints that some EU ruling/regs related to facial recognition led everyone to remove or geolock this feature. Still nothing specific, though.)

gardenhedge · on Dec 11, 2023

How/where do you stay up to date with this stuff?

stavros · on Dec 11, 2023

https://www.reddit.com/r/LocalLLaMA/ is pretty good, it is a bit fanboy-ey, but those kinds of sites are where you get the good news.