No, I am referring to the 7Bx8 MoE model. The MoE layers apparently can be sparsified (or equivalently, quantized down to a single bit per weight) with minimal loss of quality.
Inference on a quantized model is faster, not slower.
However, I have no idea how practical it is to run a LLM on a phone. I think it would run hot and waste the battery.
Really? Well that's very exciting. I don't care about wasting my battery if it can do my menial tasks for me, battery is a currency I'd gladly pay for this use case.
I think we're at least a couple generations away where this is feasible for these models, unless say it's for performing a limited number of background tasks at fairly slow inference speed. SOC power draw limits will probably limit inference speed to about 2-5 tok/sec (lower end for Mixtral which has the processing requirements of a 14B) and would suck an iPhone Max dry in about an hour.
Maybe this way, we'll not just get user-replaceable batteries in smartphones back - maybe we'll get hot-swappable batteries for phones, as everyone will be happy to carry a bag of extra batteries if it means using advanced AI capabilities for the whole day, instead of 15 minutes.
Not true. Not everyone is building chat bot or similar interface that requires output with latency low enough for a user. While your example is of course incredibly slow, there are still many interesting things that could be done if it was a little bit quicker.
Not LLMs, but locally running facial and object recognition models on your phone's gallery, to build up a database for face/object search in the gallery app? I'm half-convinced this is how Samsung does it, but I can't really be sure of much, because all the photo AI stuff works weirdly and in unobservable way, probably because of some EU ruling.
(That one is a curious case. I once spent some time trying to figure out why no major photo app seems to support manually tagging faces, which is a mind-dumbingly obvious feature to support, and which was something supported by software a decade or so ago. I couldn't find anything definitive; there's this eerie conspiracy of silence on the topic, that made me doubt my own sanity at times. Eventually, I dug up hints that some EU ruling/regs related to facial recognition led everyone to remove or geolock this feature. Still nothing specific, though.)
Mistral-small seems to be the most direct competitor to gpt-3.5 and it’s cheaper (1.2 eur / million tokens)
Note: I’m assuming equal weight for input and output tokens, and cannot see the prices in USD :/