Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Some teams have researched ways to do this.

For instance, you can have a smaller model generate ten tokens in sequence, and then ask the larger mode "given these N tokens, what is the token N+1" ten times in parallel.

If the large and small model agree on, say, the first 7 tokens, then you keep these and throw the next 3 away and start over. So you still have to run the large model for each token, but you can at least do batch calculations (which is a lot more efficient, because loading layer weights is the bottleneck, not matrix ops).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: