I don't think your comment is really true, LLM providers and researchers have been a bit too eager to claim their software is mystically complex. Anthropic's research is shedding light on interpretability, there has been good work done on the computational complexity side, and I am quite confident that the issue is LLM's newness and complexity, not that the problem is actually intractable (or specifically "more intractable" than other hopelessly complex software like Facebook or Windows).
To the extent the problem is intractable, I think kt mostly reflects that LLMs have an enormous amount of training data and do an enormous amount of things. But for a given specific problem the training data can tell you a lot:
- whether there is test contamination with respect to LLM benchmarks or other assessments of performance
- whether there's any CSAM, racist rants, or other things you don't want
- whether LLM weakness in a certain domain is due to an absence of data or if there's a more serious issue
- whether LLM strength in a domain is due to unusually large amounts of synthetic training data and hence might not generalize very reliably in production (this is distinct from test contamination - it is issues like "the LLM is great at multiplication until you get to 8 digits, and after 12 digits it's useless")
- investigating oddness like that LeetMagikarp (or whatever) glitch in ChatGPT
To the extent the problem is intractable, I think kt mostly reflects that LLMs have an enormous amount of training data and do an enormous amount of things. But for a given specific problem the training data can tell you a lot:
- whether there is test contamination with respect to LLM benchmarks or other assessments of performance
- whether there's any CSAM, racist rants, or other things you don't want
- whether LLM weakness in a certain domain is due to an absence of data or if there's a more serious issue
- whether LLM strength in a domain is due to unusually large amounts of synthetic training data and hence might not generalize very reliably in production (this is distinct from test contamination - it is issues like "the LLM is great at multiplication until you get to 8 digits, and after 12 digits it's useless")
- investigating oddness like that LeetMagikarp (or whatever) glitch in ChatGPT