This is great. Now, how do we inference these models economically? It appears there's some kind of competition to train larger and larger models, but the inferencing side of the story seems to be neglected?
Does that hold as the workload scales up? E.g. could this or similar models be used as part of a general-purpose search engine whereby (at least) one inference is completed per unique search? Aside from computation, I know these models consume an intense amount of memory -- would that scale horizontally easily / economically? Would it need to?
Google is using BERT for most/all search queries [1]. BERT is far smaller than Megatron (340M < 530B), but still "big" in a traditional sense (in the blog they say they are using TPUs for inference).
Don't forget that Google is not stopping at BERT. I dunno if they used any of the T5s or Switch Transformers in production, but they've said that MUM (O(100b)?) is going to run for production search queries, and no one knows what 'Pathways' is (multimodal O(1000b) MoE?) that Jeff Dean has been enthusing about.
Interesting, though as you noted it's less than 1/1000th the size of the 530B model and according to the article is only used in about one in ten U.S.-based English-language searches, at least when that was written.
It's a ML term, inference basically means using the probability model you learned to draw "inferences" about a piece of data. In this context, it means giving the language model some context and using some method (either arg max sampling or something more sophisticated like beam search) to do what amounts to statistical auto implemention on it. As you might imagine, doing this with 530 GB of data at speed is quite energy intensive, even though there are things you can do to compress the model (distillation, pruning, compression/discretization) and specialised inference hardware.
Technically there is some very specific meaning to inference vs. prediction, but it's been heavily overloaded with meaning by now