Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Megatron-Turing NLG 530B, the World’s Largest Generative Language Model (nvidia.com)
116 points by selimonder on Oct 11, 2021 | hide | past | favorite | 99 comments


So we now have models with 0.5 trillion parameters, each the weight of a connection in a neural network.

Trillion-parameter models are surely within reach in the near term -- and that's only within two orders of magnitude of the number of synapses in the human brain, which is in the hundreds of trillions, give or take. To paraphrase the popular saying, a trillion here, a trillion there, and pretty soon you're talking really big numbers.

I know the figures are not comparable apples-to-apples, but still, I find myself in awe looking at how far we've come in just the last few years, to the point that we're realistically contemplating the possibility of seeing dense neural networks with hundreds of trillions of parameters used for real-world applications in our lifetime.

We sure live in interesting times.


I don't understand this kind of comment. To my mind what it amounts to is "look at how big it is". Alright. So it's big. So what? Is this an elephant pageant?

Suppose a friend comes over and says "I went for dinner at a restaurant. Oh my god the portions were sooo big!". Wouldn't you want to know more information about the food and the restaurant, before you decided whether you're interested in it?

I appreciate that "big" is in peoples' minds associated with "strong", but most of the work in making language models bigger and bigger goes against the normal trend in computer science [1] and also neural networks research in gneral where the trend is to constantly try to reduce the size of models and improve their data efficiency.

What's worse, the trend to supersize language models is never justified, either theoretically (ha ha) or empirically in the relevant literature - and when rival teams make the obvious experiments the evidence is that size is not required to achieve good performance. For example:

It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

https://aclanthology.org/2021.naacl-main.185/

____________

[1] Imagine someone bragging that their mergesort implementation has a million LOC! People brag about implementations in few lines of code, not many.


> What's worse, the trend to supersize language models is never justified, either theoretically (ha ha) or empirically in the relevant literature

What? That's absurd. Large language models are motivated by empirical scaling law. It is actually better justified than other ML research.

Scaling Laws for Neural Language Models: https://arxiv.org/abs/2001.08361


> I appreciate that "big" is in peoples' minds associated with "strong"

In my mind, this is now called Pakled reasoning, from the scene in Star Trek: Lower Decks.

  Pakled rebel turned leader:
  "I am now Pakled leader. Behold my giant helmet!"

  Other Pakled:
  "He is strong!"
https://www.youtube.com/watch?v=lv1uhAa_M_U&t=193s


> I don't understand this kind of comment. To my mind what it amounts to is "look at how big it is". Alright. So it's big. So what?

It's a good question. A while ago, Rich Sutton wrote a good answer for it.: http://incompleteideas.net/IncIdeas/BitterLesson.html -- I recommend reading the whole essay. Quoting him (emphasis mine):

> The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.

> We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.

> One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

A key related question -- to which no one has the answer today -- is whether we must scale computation to match or exceed that of the human brain to be able to replicate or surpass its cognitive abilities. (Note that this question is independent of whether doing so would require future theoretical breakthroughs -- another question to which no one knows the answer today.)

PS. See also sanxiyn's response: https://news.ycombinator.com/item?id=28838745


Yes, I've read "The Bitter Lesson". Have you read "A better lesson", by Rodney Brooks?

Edit:

>> A key related question -- to which no one has the answer today -- is whether we must scale computation to match or exceed that of the human brain to be able to replicate or surpass its cognitive abilities.

What "computation" is that? Are you talking about scaling up neural networks, which is more in the context of the conversation, but requires some very big assumptions about (artificial) neural networks? Do you mean a different kind of computation?

(Note: my comment, plus the above edit, is a series of questions and I recognise that commens like that can come across as standoffish. This is not my intention, so please accept the questions above as having been asked in the most neutral tone as possible and in the interest of promoting conversation, rather than confrontation.)


> Are you talking about scaling up neural networks, which is more in the context of the conversation, but requires some very big assumptions about (artificial) neural networks?

Yes. But note that under the rubric of "deep neural networks" or "deep learning," I would include a lot of things, including combinations of methods like "deep reinforcement learning," learning by self-play via gradual evolution of surviving models, models that use "dense associative memories," of which transformers are only one special case, and future deep learning methods that have not yet been discovered.

And yes, some very big assumptions are required!

FWIW, your comments did not come across as standoffish to me :-)


New AI fallacy: appeal to size


Except that every synapse is not a dumb weight but a highly complex system connected to an even more complex system (aka neuron) which might each be a (super)computer on its own.

Given how extremely bad we are at computing, there is hope (for ai) that the neurons or their circuits are not _that_ powerful after all.


Emulating a neuron != taking a comparable part in a computation. Probably the former is a lot more complex. For instance, an artificial net can take advantage of backpropagation in a separated training phase -- that's a lot of complexity that's factored out of the runtime phase.


Wonder how this architecture is limiting the space, though - all biological brains train continuously. Our DNNs are more like a brain upload snapshot that's always run for one cycle and then rebooted.


Yeah, that's worth exploring more. It's just that it's not safe to depend on "real neurons are complicated, and therefore artificial nets of simple units won't have transformative capabilities".


Last I heard (and I believe this could be wrong) my professor said that we basically understand how a single neuron works. That like basically if we do X input we get Y output, up to some accuracy. He used this to discuss the idea behind neural networks -- that each neuron is simple enough to model, all we need to worry about is the weights and the dynamics of the network as a whole.

How much of a simplification is that? And how much does the accuracy of such a model matter, in the grand scheme of things?


oh finally something that I learned a lot about :)

Such research is the area of computational neuroscience - one thing that such people do is try to model parts of the brain (or just a single neuron) with computers.

A Neuron (=nerve cell in the brain) is a very complex beast. In rough terms they work like this: They collect signals (electrical impulses) via their small appendages called dendrites. when the sum of the signals reaches a certain threshold a large electrical impulse is generated at the cell body that will travel trough its "output" appendage (called axon) that connected to another neuron's cell body or to its dendrite.

Neurons display a dazzling variety in all these parameters:

- In morphology, e.g. they can look like a pine tree http://www.scholarpedia.org/article/Pyramidal_neuron (I really recommend scholarpedia, also this article has a nice animation on how electrical impulses propagate) or like a sea urchin.

- it really matters where the cell gets its impulse from: A neuron stimulated near its cell body will be much more sensitive to the input than being stimulated far away.

- Their response characteristics are wildly varied too. Some give off one large impulse, some a quick burst of impulses. Some are preventing others from giving out impulses from stimulation (inhibitor neurons)

- This whole mess can be modulated with chemical compounds that are released by the body -- some make some neurons more sensitive, some less.

- Also we still discover every year some new mechanism that modulates how they function.

The issue is that this results in such a complex system that a modern PC cant even simulate 1 detailed neuron model realtime (these tools are open source, try them out! for example https://neuron.yale.edu/ ). Now we know that we're simulating things that likely do not matter (e.g. we don't need a neuron model that consist of 10.000+ segments), but we do not know which parts we need to remove to have a faithful simulation. Also we might simply simulate some parts wrong because our knowledge of the subject is not enough.

But on the upside we've reached some great things already, for example we know how our brain calculates from our head and eye position the orientation of the things we're looking at


> All we need to worry about is the weights and the dynamics of the network as a whole. How much of a simplification is that?

A lot. Parallel optimization is an art form. These models are trained on static datasets, they can't intervene in the environment to infer causal relations, so they need legs and hands.



I would say quite a bit. Adding even a third body makes it impossible to calculate physics with certainty. A complex system with any number of individual components is hard to understand with certainty and/or calculations can become exponentially more complex .


Citing the fact that the 3 body problem doesn't always have an exact solution is a straw man argument.


The intent was to illustrate that complexities of a system of simple components can be pretty difficult. Automata theory has more appropriate examples perhaps.


Your professor lied.


That’s correct understanding as of 1943 when the “artificial neural network“ model your professor is teaching was developed.

There is a whole lot of new knowledge on how live neurons and networks of neurons work that had been collected in the last 75 years in the neuroscience domain but it’s mostly ignored by computer scientists.


I’d be really interested in learning more about this. Can you point me to some easily grokable literature?



Won't somebody think of the exosomes and telocytes?


My issue with this kind of reasoning is the comparison and reference to the human brain. The potential and reach of AI transcends the brain. We never had to master the "mystery" of how birds fly to invent aviation. It was never necessary to compare the number of turbine revolutions of early airplanes to the number of an eagle's feathers. Maybe birds were an inspiration or a metaphor, but thankfully aviation has not been limited to the means of propulsion of the beautiful yet humble pigeon. The potential of aviation has taken us into space exploration and massive international travel. I don't know where AI will take us, but I don't think it will be constrained by this temporary organ called 'human brain'.


Not all weights are born equal, different paradigms allow more parameters while being less parameter-efficient, e.g. https://openreview.net/forum?id=TXqemS7XEH


If retrieval based NLP [0] becomes a thing, then trillion plus parameters models will likely be less of a thing; as very likely, most of these tens to hundreds of billions of parameters are likely over-fitting (better word: memorized) on training data [text corpus] as seen in the case of GPT-3.

[0] https://ai.stanford.edu/blog/retrieval-based-NLP/


Yes, self-attention mechanisms are dense associative memories, so it might be possible to replace them in many cases with simpler storage mechanisms. Still, I would count the required storage space as part of a model's parameter size -- e.g., a model consisting of 1 trillion values in RAM and 99 trillion values in storage consists of... 100 trillion values.


The switch transformer has already achieved a trillion parameters.

https://arxiv.org/abs/2101.03961


Unless we have misunderstood neurons, and microtubules are the fundamental computational unit in which case we are out by an order of magnitude


There was a result recently of modeling an organic neuron with 1000 digital neurons.

And even if that result was perfect modeling of the neuron, that assumes perfect and exhaustive data readings on the organic neuron, which is, frankly, unlikely. (Not that I know how to estimate how much it's missing, but I don't think we fully understand a single neuron yet.)


CPU in kilohertz then megahertz then gigahertz then it stopped.

RAM in kilobytes then megabytes then gigabytes then it stopped.


Yes for CPU, no for RAM. You can buy a computer with terabytes of RAM just fine. It's just expensive.


Expensive is a really relative term right about now... 64gb DDR4 LRDIMMs could be had for about 250$ each on eBay before the chip shortage. While that price is a "good-ish deal", it really wasn't unheard of. A search I just did returns more than a few hits...

Now compare to single GPU prices...


Those are material science and physical limitations.

Number of parameters in a neural network is not really limited that way, doing useful compute with it is a different matter


China's Wu Dao 2.0 has 1.75 trillion parameters. https://towardsdatascience.com/gpt-3-scared-you-meet-wu-dao-...


A 10 trillion parameter model was mentioned here: https://mobile.twitter.com/ethancaballero/status/14458268620...


That's MoE.


Mixture of Experts, aka not all 10 trillion parameters are used at the same time, just a subset that is an "expert" on the "task at hand".


That would also describe an organic brain though.


What's really interesting is that these models are using some non-trivial portion of all easily accessible human writing -- yet humans learn language really well with significantly less input data. What's missing in the field to replicate human performance in learning?


Imagine you lived in a black room and all you can see is a buffer of text scrolling in front of you. Nobody explains what the symbols mean. You don't remember anything medium term, you can only access a short snippet of text at a time and form long term memories gradually. You just look and predict what will come next. Who could become an intelligent person being raised in these conditions?

So they are missing 2 years worth of visual, auditory, tactile and other modalities (grounding), having direct access to change their environment (embodiment) and being part of our society or an AI society (social).


Humans use language to accomplish tasks in their environment - establishing relationships, making deals, coaxing others, etc. By contrast, all neural language models do is predict the next word as a function of the previous word. So far, these language models have nothing at all to do with language learning. They're only valuable insofar as they advance downstream engineering tasks like machine translation.


One dimension that isn't being considered is that humans have had billions of years of evolution, while neural networks are essentially a blank slate.


https://arxiv.org/pdf/1802.10217.pdf

This is the paper I love to link in response to these sort of objections.


Human priors are a feature, not a bug. The reason deep neural networks need so much data to train them (and still do not handle languag nearly as well as humans) is precisely that humans have "background knowledge", things that we already know and don't have to learn all over again from scratch. We bring this knowledge to bear in our ability to learn and understand language. Deep neural networks on the other hand have only a very, very limited ability to represent and therefore use background knowledge- and so they have to make up for it with astounding amounts of data.

The OP's criticism is valid. Being forced to learn everything end-to-end, from scratch, is a severe limitation.


> Human priors are a feature, not a bug. ... Being forced to learn everything end-to-end, from scratch, is a severe limitation.

They're neither. When performing very human-adjacent tasks, it will certainly put the ML algorithm at a disadvantage compared to us.

But for non-human adjacent tasks, say interpreting what a sequence of amino acids actually means, we can expect the computer to absolutely crush us because our stupid human heuristics take us absolutely nowhere, cause us to see patterns that aren't there, etc. etc.

Regardless, this is irrelevant to the original point that I was making, which is that comparing the performance of DL on human adjacent tasks to the amount of time it takes a human to learn the same task is misguided because you are ignoring the million-year long optimization process to get there.


I think what's very interesting is that most of the answers to my question sort of boil down (if you squint a little at the answers) to "this is hard, but general AI will make it easy".


My apologies if the wording of my comment was confusing but what you say is not at all what I meant.


We sure don't read entire wikipedia but out speech/text consumption is pretty high and we take long time to learn. At 150 words/minute, I would say babies probably consume about 10 million tokens before they start to speak. Baby's training time is much higher than just few days compared for a GPU cluster. Also, baby's vocab is very small and can do very limited things compared to these large models (for example, can't answer who is the king of England or what is the capital of Bangladesh).

This is not to say that language models are efficient, of course. That's not even remotely true. But we seem to under-estimate how much time and resources we need to learn something.


Training data has 0.339T tokens, less than the number of training parameters. A model like that could store all of the training text with 100B+ parameters left for computation.


Parameters have been sufficient to memorize the training data for a while now. The fact that neural networks still generalize in this setting is a big mystery that is under active investigation.


For some reason this issue with model having insane amounts of weights but training data being small is not something that is an issue for modern NNs.



But then you try to predict the next token on a completely unseen piece of the corpus and fail miserably if all you do is store the training data.


A single weight can’t encode an individual word, but the ratio looks close to overfitting too me too.


I've often wondered if a lighter reinforcement learning based model on top of a full text index might do as well or better than these putatively overfit language models. Curious if anyone knows of ongoing or recent work on this approach.


If 16-bit floating-point numbers are used, it can presumably encode all tokens. In theory. It would not be very easy to work with.


Maybe that’s what it’s doing under the hood.


This reminds me a little bit of the early 2000's where search engines would list the number of indexed pages on their homepage. For language models, does large = good? I'm guessing the quality of the corpus matters as much.


>For language models, does large = good?

The short answer is yes, the long answer is it's complicated.

You could actually think of these models as a type of indexer because, at their heart, what they are doing is memorizing the training data and storing it in such a way that incomplete samples can be used as keys to extract complete samples. The magic happens because the models themselves (even the 100+ billion parameter ones) are nowhere near complex enough to actually store all of these possible key value pairs. Instead, the model has to compress its representation of the data which leads to generalization. Larger models can model more complexities which leads to better performance as long as your training dataset is sufficiently large and varied.


Maybe? The Scaling Hypothesis[1] suggests that greater capabilities of intelligence may emerge from scaling up 'scalable architectures' to large sizes. GPT-3 exhibits 'meta-learning' capabilities that GPT-2 did not (like learning how to sum numbers)--probably just because its a 100x larger version of GPT-2.

[1] https://www.gwern.net/Scaling-hypothesis


I'm not sure it moves the needle on NLU/classification tasks very much, compared to models with many fewer parameters. But it does seem to make the NLG better, which is what Microsoft seems obsessed with lately.


This is great. Now, how do we inference these models economically? It appears there's some kind of competition to train larger and larger models, but the inferencing side of the story seems to be neglected?


Model inference is actually comparatively very cheap. If you have the resources to train a model, you most definitely have the resources to run it.


Not necessarily, you train once , you run inferences billions of times maybe. The compute required could be beyond your resources.


Does that hold as the workload scales up? E.g. could this or similar models be used as part of a general-purpose search engine whereby (at least) one inference is completed per unique search? Aside from computation, I know these models consume an intense amount of memory -- would that scale horizontally easily / economically? Would it need to?


Google is using BERT for most/all search queries [1]. BERT is far smaller than Megatron (340M < 530B), but still "big" in a traditional sense (in the blog they say they are using TPUs for inference).

[1] https://blog.google/products/search/search-language-understa...


Don't forget that Google is not stopping at BERT. I dunno if they used any of the T5s or Switch Transformers in production, but they've said that MUM (O(100b)?) is going to run for production search queries, and no one knows what 'Pathways' is (multimodal O(1000b) MoE?) that Jeff Dean has been enthusing about.


Interesting, though as you noted it's less than 1/1000th the size of the 530B model and according to the article is only used in about one in ten U.S.-based English-language searches, at least when that was written.


When you say "inference", do you mean "interface", or is "inference" an ML term I'm not familiar with?


It's a ML term, inference basically means using the probability model you learned to draw "inferences" about a piece of data. In this context, it means giving the language model some context and using some method (either arg max sampling or something more sophisticated like beam search) to do what amounts to statistical auto implemention on it. As you might imagine, doing this with 530 GB of data at speed is quite energy intensive, even though there are things you can do to compress the model (distillation, pruning, compression/discretization) and specialised inference hardware.

Technically there is some very specific meaning to inference vs. prediction, but it's been heavily overloaded with meaning by now


First you train a model then you use it, "inference" is a fancy word for using the model.



I guess I'm interested to see if this performs qualitatively better than GPT-3, given how many more parameters it has.

However, I think this is really a dead-end: throwing more hardware at this is just going to generate better-sounding nonsense. Yes, we are learning the "model" of the English language - which words go with which others, but successively larger transformer models don't really expose much more about the nature of intelligent conversation.

I think we need a better algorithm now.


Also agree with this, it's almost like a marketing ploy - especially from OpenAI. They produce awesome stuff but things with GPT can get silly sometimes, like when they wouldn't release the larger versions because 'they were too powerful', and in the end you ask it how many eyes my foot has and it says seven...

It producing interesting results, but doesn't really progress the field - although who knows, maybe skynet is actually a 100T parameter transformer


Listening to conversations and learning how they flow is only one aspect of language learning. What these things are missing is the interactive part. Next generation systems need to be able to form hypotheses about what appropriate responses should be, try out various responses and then see what the results are. They can only learn so much from passive consumption of training sets, so I agree this approach is going to hit a wall of diminishing returns.


China's WuDao model had 1.75T parameters and Google's Switch transformer had 1.6T. How is this the world's largest then?


Interesting that books3 and The Pile are among the largest corpus used for training - both with copyright concerns.


Do you want to be the reason we can't have nice things? Please don't post things like this.


They’re using and citing it, better than whatever GPT-3 did.


Really cool!

I'd love to see a table comparing the results against the other gigantic models (I know could Google the other results and merge them together but no thanks)


Has there been any update on the legality of using this kind of model? Is it ok to just crawl the web, take any content you want, train a model and sell access to the model like OpenAI/GPT-3/GitHub Copilot?


For the most part, everything that's not barred by law is "legal." Does this use constitute copyright infringement (if it were trained on copyrighted material)? IMO no, but it depends very much on the use of the model. Copilot is especially interesting because instead of being used for simple inference the model is being used to author new works that might aspire to also be copyrighted. Are those new works derivative works? Perhaps. We consider art and science produced by humans to be inspired in part by that which they've been exposed to before. If the model hasn't been overfitted, it should generalize its 'knowledge' sufficiently that it's 'similar' to our intelligence. Humans can commit copyright infringement when they recall and author content so specifically as to be a derived work.

In any case: my opinion matters for naught. The only 'update' you'd get that matters is from a court producing a ruling. Legal journals might chime in but their opinion isn't binding. Theoretically there could be legislation to clarify but that's probably a really, really, really long way off.

Certainly some of the training looks to be content that's not copyrighted or no longer copyrighted, btw.


Any idea if they’ll release an API similar to GPT-3? It’s great that larger and larger models are trained but without enabling access to the trained models developers are left out from the progress…


I hope they don’t release an API the way they released an API for GPT-3.


Why? What do you do not like about GPT-3 API?


Can't fine tune it. Can't use it on private data. Pay per symbol. High latency access. Inconsistent performance. No internet disconnected use. Can't guarentee repeatable results. Can't (easily) replace the sampling with alternatives that have different behavior (e.g. for using it for compression).


So, uh, what do the seed-to-seed variance studies look like on a network of this size? Surely someone trained 100 to see the distribution. ducks


How has the previous largest model, gpt3, generateda value? How much better is this model at those tasks?


GPT-3 powers GitHub Copilot, so it generated some value.


Wonder how much compute it would cost to train this thing, if you weren't Nvidia..


Will anyone outside of Nvidia be able to access it? GPT-3 at least has an API.


(Team member of this project) Just a clarification, both Microsoft and Nvidia have ownership of this model. Here is the Microsoft version of same announcement.

https://www.microsoft.com/en-us/research/blog/using-deepspee...


hoping so!


Same!


What's the perplexity?


all these models are over-hyped. we are nowhere close to AGI until we can come up with a reasonable definition for consciousness


It's possible consciousness isn't a real thing. Humans might just be big neural networks that predict the actions most likely to result in survival.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: