They work in a token space whose metrical structure is given by proxies for conc...

bsaul · 2025-05-28T14:14:39 1748441679

As always with definitive assertions regarding LLMs incapacities, i would be more convinced if one could demonstrate those assertions with an illustrative example, on a real LLM.

So far, the abilities of LLM to manipulate concepts, in practice, has been indistinguishable in practice from "true" human-level concept manipulation. And not just for scientific, "narrow" fields.

mjburgess · 2025-05-28T15:12:15 1748445135

The problem with mental capacities, is that they are not measured by tests. We have no valid and reliable way of determining them. Hence why "metrics psychology" is a pseudoscience.

If I give a child a physics exam, and they score 100% it could either be because they're genuinely a genius (possessing all relevant capabilities and knowledge), or because they cheated. Suppose we dont know how they're cheating, but they are. Now, how would you find out? Certainly not by feeding more physics exams, at least, its easy enough to suppose they can cheat on those.

The issue here is that the LLM has compressed basically everything written in human history, and the question before us is "to what degree is a 'complex search' operation expressing a genuine capability, vs. cheating?"

And there is no general methodological answer to that question. I cannot give you a "test", not least because I'm required to give you it in token-in--token-out form (ie., written) and this dramatically narrows the scope of capability testing methods.

Eg., I could ask the cheating child to come to a physics lab and perform an experiment -- but I can ask no such thing from an LLM. One thing we could do with an LLM is have a physics-ignorant-person act as an intermediary with the LLM, and see if they, with the LLM, can find the charge on the electron in a physics lab. That's highly likely to fail with current LLMs, in my view -- because much of the illusion of their capability lies in the expertise of the prompter.

> has been indistinguishable in practice from "true" human-level concept manipulation

This claim indicates you're begging the question. We do not use the written output of animal's mental capabilities to establish their existence -- that would be a gross pseudoscience; so to say that LLMs are indistinguishable from anything relevant indicates you're not aware of what the claim of "human-level concept manipulation" even amounts to. It has nothing to do with emitting tokens.

When designing a test to see if an animal possesses a relevant concept, can apply it to a relevant situation, can compose it with other concepts, and so on -- we would never look to linguistic competence, which even in humans, is an unreliable proxy: hence the need for decades of education and the high fallibility of exams.

Rather if I were assessing "does this person understanding 'Dog'?" I would be looking for contextual competence in application of the concept in a very broad role in reasoning processes: identification in the environment, counterfactual reasoning, composition with other known concepts in complex reasoning processes, and the like.

All LLMs do is emit text as-if they have these capacities, which makes a general solution to exposing their lack of them, basically methodologically impossible. Training LLMs is an anti-inductive process: the more tests we provide, the more they are trained on them, so the tests become useless.

Consider the following challenge: there are two glass panels, one is a window; and the other is a very high def TV showing a video game simulation of the world outside the window. You are fixed at a distance of 20 meters from the TV, and can only test each glass pane by taking a photograph of it, and studying the photograph. Can you tell which window is the outside? In general, no.

This is the grossly pseudoscientific experimental restriction people who hype LLMs impose: the only tests are tokens-in, tokens-out -- "photographs taken at a distance". If you were about to be throw against one of these glass panels, which would you choose?

If an LLM was, based on token in/out analysis alone, put in charge of a power plant: would you live near by?

It matters if these capabilities exist, because if real, the system will behave as expected according to capabilities. If its cheating, when you're thrown against the wrong window, you fall out.

LLMs are in practice, incredibly fragile systems, whose apparent capabilities quickly disappear when the kinds of apparent reasoning they need to engage in are poorly represetned in their training data.

Consider one way of measuring the capability to imagine that isnt token/token: energy use and time-to-compute:

Here, we can say for certain that LLMs do not engaged in counterfactual reasoning. Eg., we can give a series of prompts (p1, p2, p3...) which require increasing complexity of the imagined scenario, eg., exponentially more diverse stipulations, and we do not find O(answering) to follow O(p-complexity-increase). Rather the search strategy is always the same for single-shot prompt: so no trace thru an LLM involves simulation. We can just get "mildly above linear" (apparent) reasoning complexity with chain-of-thought, but this likewise does not follow the target O().

The kinds of time-to-compute we observe from LLM systems are entirely consistent with a "search and synthesis" over token-space algorithm, that only appears to simulate if the search space contains prior exemplars of simulation. There is no genuine capability

bsaul · 2025-05-28T15:24:01 1748445841

"…we would never look to linguistic competence".

On the contrary, i strongly believe that what LLM proved is the fact linguists have always told us about : that the language provides a structure on top of which we're building our experience of concepts (Sapir whorf hypothesis).

I don't think one can conceptualize much without the use of a language.

mjburgess · 2025-05-28T15:32:05 1748446325

> I don't think one can conceptualize much without the use of a language.

Well a great swath of the animal kingdom stands against you.

LLMs have invited yet more of this pseudoscience. It's a nonesense position in an empirical study of mental capabilites across the animal kingdom. Something previuosly only believed by idealist philosophers of the early 20th century and prior. Now brought back so people can maintain their image in the face of their apparent self-deception: better we opt for gross pseudoscience than admit we're fooled by a text generation machine.

latentnumber · 2025-05-28T13:06:10 1748437570

I would agree with this if the LLM never really modified the initial linear embeddings, but non-linearity in MLP layers and position/correlation fixing in the attention layers would mean that things are not so simple. I’m pretty sure there are papers showing compositionality and so on being represented by transformers.