- The lookup table theory of LLMs:
-
An LLM will succeed at a task if and only if its training data contains sufficiently similar tasks.
The theory says LLMs are glorified loookup tables. More precisely we can say:
- LLMs have more training data than humans.
- LLMs are worse at generalization than humans.
- Implication: LLMs are bad at rare tasks.
- We can classify tasks on two dimensions: (i) how difficult it is; (ii) how rare it is (i.e. rarely faced in human society, and thus the training data). The lookup-table theory says rareness will be a relatively better predictor for LLM success than for human success. A classic example is comparing chess to chmess, where chmess is a version of chess with the rules tweaked. A typical human might be somewhat better at chess; a typical LLM will be far better at chess than chmess, because their skill comes so much from imitation.
- The lookup table theory unifies many observations about capabilities.
- I think this is a pretty orthodox understanding of LLM strengths & weaknesses but I haven’t seen it written down in this way. I think the theory helps explains many observations about task-properties that are predictive of LLM success: time horizon, verifiability, separability, out-of-distributionness, hill-climbability, messiness.
Implications
- LLMs succeed at a rare task if it can be decomposed into common tasks.
- Recent LLMs have the ability to do basic decompositions of task into subtasks. As a consequence they can do those rare tasks which can be easily decomposed into common tasks. An example: “multiply 132 and 832 and tell me the capital of Bolivia.” This is rare (there are few similar examples in the training data) but it can easily be decomposed into two tasks that are common.
- LLMs fail at tasks that take humans a long time.
- ==[FIX THIS!!!]== METR finds that among software tasks the LLM success rate has an 85% correlation with the human time required to complete that task. This will be true if long tasks tend to contain subtasks that are rare, i.e. which are farther from the training data. This would be true if tasks contain a varying number of subtasks, then those with many subtasks both (a) take humans longer, and (b) are more likely to contain a rare task, which is out of the reach of an LLM.
- LLMs are good at hill-climbable tasks.
- (…)
- LLMs successes are highly correlated.
- Many recent papers find a high correlation in agent success rates across benchmarks, e.g. Ruan, Maddison, and Hashimoto (2024). If one LLM beats another on benchmark A, then it probably also beats that LLM on benchmark B. An interpretation is that all LLMs are trained on pretty similar sets of training data – i.e. the big AI labs all try to cover roughly the same ground – and so metric of “closeness to the training data” is pretty similar across models.
- LLMs are good at separable tasks.
- (…)
References
Ruan, Yangjun, Chris J. Maddison, and Tatsunori Hashimoto. 2024. “Observational Scaling Laws and the Predictability of Language Model Performance.” https://arxiv.org/pdf/2405.10938.pdf.