This is great. Now, how do we inference these models economically? It appears th...

dwohnitmok · on Oct 11, 2021

Model inference is actually comparatively very cheap. If you have the resources to train a model, you most definitely have the resources to run it.

manquer · on Oct 12, 2021

Not necessarily, you train once , you run inferences billions of times maybe. The compute required could be beyond your resources.

shock-value · on Oct 12, 2021

Does that hold as the workload scales up? E.g. could this or similar models be used as part of a general-purpose search engine whereby (at least) one inference is completed per unique search? Aside from computation, I know these models consume an intense amount of memory -- would that scale horizontally easily / economically? Would it need to?

niklasd · on Oct 12, 2021

Google is using BERT for most/all search queries [1]. BERT is far smaller than Megatron (340M < 530B), but still "big" in a traditional sense (in the blog they say they are using TPUs for inference).

[1] https://blog.google/products/search/search-language-understa...

gwern · on Oct 12, 2021

Don't forget that Google is not stopping at BERT. I dunno if they used any of the T5s or Switch Transformers in production, but they've said that MUM (O(100b)?) is going to run for production search queries, and no one knows what 'Pathways' is (multimodal O(1000b) MoE?) that Jeff Dean has been enthusing about.

shock-value · on Oct 12, 2021

Interesting, though as you noted it's less than 1/1000th the size of the 530B model and according to the article is only used in about one in ten U.S.-based English-language searches, at least when that was written.

buffington · on Oct 11, 2021

When you say "inference", do you mean "interface", or is "inference" an ML term I'm not familiar with?

igorkraw · on Oct 11, 2021

It's a ML term, inference basically means using the probability model you learned to draw "inferences" about a piece of data. In this context, it means giving the language model some context and using some method (either arg max sampling or something more sophisticated like beam search) to do what amounts to statistical auto implemention on it. As you might imagine, doing this with 530 GB of data at speed is quite energy intensive, even though there are things you can do to compress the model (distillation, pruning, compression/discretization) and specialised inference hardware.

Technically there is some very specific meaning to inference vs. prediction, but it's been heavily overloaded with meaning by now

sanity · on Oct 11, 2021

First you train a model then you use it, "inference" is a fancy word for using the model.