Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is great. Now, how do we inference these models economically? It appears there's some kind of competition to train larger and larger models, but the inferencing side of the story seems to be neglected?


Model inference is actually comparatively very cheap. If you have the resources to train a model, you most definitely have the resources to run it.


Not necessarily, you train once , you run inferences billions of times maybe. The compute required could be beyond your resources.


Does that hold as the workload scales up? E.g. could this or similar models be used as part of a general-purpose search engine whereby (at least) one inference is completed per unique search? Aside from computation, I know these models consume an intense amount of memory -- would that scale horizontally easily / economically? Would it need to?


Google is using BERT for most/all search queries [1]. BERT is far smaller than Megatron (340M < 530B), but still "big" in a traditional sense (in the blog they say they are using TPUs for inference).

[1] https://blog.google/products/search/search-language-understa...


Don't forget that Google is not stopping at BERT. I dunno if they used any of the T5s or Switch Transformers in production, but they've said that MUM (O(100b)?) is going to run for production search queries, and no one knows what 'Pathways' is (multimodal O(1000b) MoE?) that Jeff Dean has been enthusing about.


Interesting, though as you noted it's less than 1/1000th the size of the 530B model and according to the article is only used in about one in ten U.S.-based English-language searches, at least when that was written.


When you say "inference", do you mean "interface", or is "inference" an ML term I'm not familiar with?


It's a ML term, inference basically means using the probability model you learned to draw "inferences" about a piece of data. In this context, it means giving the language model some context and using some method (either arg max sampling or something more sophisticated like beam search) to do what amounts to statistical auto implemention on it. As you might imagine, doing this with 530 GB of data at speed is quite energy intensive, even though there are things you can do to compress the model (distillation, pruning, compression/discretization) and specialised inference hardware.

Technically there is some very specific meaning to inference vs. prediction, but it's been heavily overloaded with meaning by now


First you train a model then you use it, "inference" is a fancy word for using the model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: