Evaluating the world model implicit in a generative model

158 points by dsubburam 7 days ago | 46 comments

An LLM necessarily has to create some sort of internal "model" / representations pursuant to its "predict next word" training goal, given the depth and sophistication of context recognition needed to to well. This isn't an N-gram model restricted to just looking at surface word sequences.

However, the question should be what sort of internal "model" has it built? It seems fashionable to refer to this as a "world model", but IMO this isn't really appropriate, and certainly it's going to be quite different to the predictive representations that any animal that interacts with the world, and learns from those interactions, will have built.

The thing is that an LLM is an auto-regressive model - it is trying to predict continuations of training set samples solely based on word sequences, and is not privy to the world that is actually being described by those word sequences. It can't model the generative process of the humans who created those training set samples because that generative process has different inputs - sensory ones (in addition to auto-regressive ones).

The "world model" of a human, or any other animal, is built pursuant to predicting the environment, but not in a purely passive way (such as a multi-modal LLM predicting next frame in a video). The animal is primarily concerned with predicting the outcomes of it's interactions with the environment, driven by the evolutionary pressure to learn to act in way that maximizes survival and proliferation of its DNA. This is the nature of a real "world model" - it's modelling the world (as perceived thru sensory inputs) as a dynamical process reacting to the actions of the animal. This is very different to the passive "context patterns" learnt by an LLM that are merely predicting auto-regressive continuations (whether just words, or multi-modal video frames/etc).

mistercow 7 days ago | root | parent | next |

> It can't model the generative process of the humans who created those training set samples because that generative process has different inputs - sensory ones (in addition to auto-regressive ones).

I think that’s too strong a statement. I would say that it’s very constrained in its ability to model that, but not having access to the same inputs doesn’t mean you can’t model a process.

For example, we model hurricanes based on measurements taken from satellites. Those aren’t the actual inputs to the hurricane itself, but abstracted correlates of those inputs. An LLM does have access to correlates of the inputs to human writing, i.e. textual descriptions of sensory inputs.

HarHarVeryFunny 7 days ago | root | parent | next |

You can model a generative process, but it's necessarily an auto-regressive generative process, not the same as the originating generative process which is based on the external world.

Human language, and other actions, exist on a range from almost auto-regressive (generating a stock/practiced phrase such as "have a nice day") to highly interactive ones. An auto-regressive model is obviously going to have more success modelling an auto-regressive generative process.

Weather prediction is really a good case of the limitation of auto-regressive models, as well as models that don't accurately reflect the inputs to the process you are attempting to predict. "There's a low pressure front coming in, so the weather will be X, same as last time", works some of the time. A crude physical weather model based on limited data points, such as weather balloon inputs, or satellite observation of hurricanes, also works some of the time. But of course these models are sometimes hopelessly wrong too.

My real point wasn't about the lack of sensory data, even though this does force a purely auto-regressive (i.e. wrong) model, but rather about the difference between a passive model (such as weather prediction), and an interactive one.

nerdponx 6 days ago | root | parent |

The whole innovation of GPT and LLMs in general is that an autoregressive model can make alarmingly good next-token predictions with the right inductive bias, a large number of parameters, a long context window, and a huge training set.

It turns out that human communication is quite a lot more "autoregressive" than people assumed it was up until now. And that includes some level of reasoning capability, arising out of a kind of brute force pattern matching. It has limits, of course, but it's amazing that it works as well as it does.

HarHarVeryFunny 4 days ago | root | parent |

It is amazing, and interesting.

Although I used the word myself, I'm not sure that "autoregressive" is quite the right word to describe how LLMs work, or our brains. Maybe better to just call both "predictive". In both cases the predictive inputs include the sequence itself (or selected parts of it, at varying depths of representation), but also global knowledge, both factual and procedural (HOW to represent the sequence). In the case of our brain there are also many more inputs that may be used such as sensory ones (passive observations, or action feedback), emotional state, etc.

Regardless of what predictive inputs are available to LLMs vs brains, it does seem that in a lot of cases the more constrained inputs of an LLM don't prevent it from sounding very human like (not surprising at some level given the training goal), and an LLM chat window does create a "level playing field" (i.e. impoverished input setting for the human) where each side only sees the other as a stream of text. Maybe in this setting, the human, when not reasoning, really isn't bringing much more predictive machinery to the table than the LLM/transformer!

Notwithstanding the predictive nature of LLMs, I can't help but also see them just as expert systems of sorts, albeit ones that have derived their own rules (much pertaining to language) rather than being given them. This view better matches their nature as fixed repositories of knowledge, brittle where rules are missing, as opposed to something more brain-like and intelligent, capable of continual learning.

Brilliant analogy.

And we can imagine that, in a sci-fi world where some super-being could act on a scale that would allow it to perturb the world in a fashion amenable to causing hurricanes, the hurricane model could be substantially augmented, for the same reason motor babbling in an infant leads to fluid motion as a child.

What has been a revelation to me is how, even peering through this dark glass, titanic amounts of data allow quite useful world models to emerge, even if they're super limited -- a type of "bitter lesson" that suggests we're only at the beginning of what's possible.

I expect robotics + LLM to drive the next big breakthroughs, perhaps w/ virtual worlds [1] as an intermediate step.

[1] https://minedojo.org/

slashdave 6 days ago | root | parent | prev |

Indeed. If you provided a talented individual with a sufficient quantity and variety of video streams of travels in a city (like New York), that person would be able to draw you a map.

You say this, yet people such as Helen Keller suggest that a full sensorium is not necessary to be a full human. She had some grasp of the idea of colour, of sound, and could use the words around them appropriately - yet had no firsthand experience of either. Is it really so different?

I think “we” each comprise a number of models, language being just one of them - however an extremely powerful one, as it allows the transmission of thought across time and space. It’s therefore understandable that much of what we recognise as conscious thought, of a model of the world, emerges from such an information dense system. It’s literally developed to describe the world, efficiently and completely, and so that symbol map an LLM carries possibly isn’t that different to our own.

HarHarVeryFunny 7 days ago | root | parent |

It's not about the necessity of specific sensory inputs, but rather about the difference in type of model that will be built when the goal is passive, and auto-regressive, as opposed to when the goal is interactive.

In the passive/auto-regressive case you just need to model predictive contexts.

In the interactive case you need to model dynamical behaviors.

madaxe_again 6 days ago | root | parent |

I don’t know that I see the difference - but I suppose we’re getting into Brains In Vats territory. In my view (well, Baudrillard’s view, but who’s counting?) a perfect description of a thing is as good as the thing itself, and we in fact interact with our semantic description of reality, rather than with raw reality itself - the latter, when it manifests in humans, results in vast cognitive dysfunction - Sachs wrote somewhat in the topic of unfiltered sensorium and the impact on the ability to operate in the world.

So yeah. I think what these models do and what we do is more similar than we might realise.

It seems to me that the human authors of the training text are the ones who have created the “world model”, and have encoded it into written language. The llm transcodes this model into word embedding vector space. I think most people can recognize a high dimensional vector space as a reasonable foundation for a mathematical “model”. The humans are the ones who have interacted with the world and have perceived its workings. The llm only interacts with the human’s language model. Some credit must be given to the humans modellers for the unreasonable effectiveness of the llm.

But if you squint then sensory actions and reactions are also sequential tokens. Even reactions can be encoded alongside input as action tokens and as single token stream. Anyone tried sth like this?

RaftPeople 5 days ago | root | parent |

> But if you squint then sensory actions and reactions are also sequential tokens

I'm not sure you could model it that way.

Animal brains don't necessarily just react to sensory input, they frequently have already predicted the next state based on previous state and learning/experience, and not just in a simple sequential manner but at many different levels of patterns involved simultaneously (local immed action vs actions part of larger structure of behavior), etc.

Sensory input is compared to predicted state and differences are incorporated into the flow.

The key thing is our brains are modeling and simulating the world around us and it's future state (modeling the physical world as well as the abstract world of what other animals are thinking). It's not clear that LLM's are doing that (my assumption is that they are not doing any of that, and until we build systems that do that, we won't be moving towards the kind of flexible and adaptable control our brains have).

Edit: I just read the rest of the parent post that said basically the same thing, was skimming so missed it.

> The "world model" of a human, or any other animal, is built pursuant to predicting the environment

What do you make of Immanuel Kant's claim that all thinking has as a basis the presumption of the "Categories"--fundamental concepts like quantity, quality and causality[1]. Do LLMs need to develop a deep understanding of these?

[1] https://plato.stanford.edu/entries/categories/#KanCon

westurner 5 days ago | root | parent |

Embodied cognition implies that we understand our world in terms of embodied metaphor "categories".

LLMs don't reason, they emulate. RLHF could cause an LLM to discard text that doesn't look like reasoning according to the words in the response, but that's still not reasoning or inference.

"LLMs cannot find reasoning errors, but can correct them" https://news.ycombinator.com/item?id=38353285

Conceptual metaphor: https://en.wikipedia.org/wiki/Conceptual_metaphor

Embodied cognition: https://en.wikipedia.org/wiki/Embodied_cognition

Clean language: https://en.wikipedia.org/wiki/Clean_language

Given human embodied cognition as the basis for LLM training data, there are bound to be weird outputs about bodies from robot LLMs.

But isn't the distinction between a "passive" and an "active" model ultimately a metaphysical (freedom of will vs. determinism) question, under the (possibly practically infeasible) assumption that the passive model gets to witness all possible actions an agent might take?

Practically, I could definitely imagine interesting outcomes from e.g. hooking up a model to a high-fidelity physics simulator during training.

People around here like to say "The map isn't the territory". If we are talking about the physical world, then language is a map not the territory, and not a detailed one either, an LLM trained on it is a second order map.

If we consider the territory to be human intelligence, then language is still a map but it is a much more detailed map. Thus an LLM trained on it becomes a more interesting second order map.

Animals could well use an autoregressive model to predict the outcomes of their actions on their perceptions. It's not like we run math in out everyday actions (it would take too long).

Perhaps thats why we can easily communicate those predictions as words

ElevenLathe 5 days ago | root | parent | prev |

We can't see neutrons either, but we have built various models of them based on indirect observations.

zxexz 7 days ago | prev | next |

I've seen some very impressive results just embedding a pre-trained KGE model into a transformer model, and letting it "learn" to query it (I've just used heterogenous loss functions during training with "classifier dimensions" that determine whether to greedily sample from the KGE sidecar, I'm sure there are much better ways of doing this.). This is just subjective viewpoint obviously, but I've played around quite a lot with this idea, and it's very easy to get a an "interactive" small LLM with stable results doing such a thing, the only problem I've found is _updating_ the knowledge cheaply without partially retraining the LLM itself. For small, domain-specific models this isn't really an issue though - for personal projects I just use a couple 3090s.

I think this stuff will become a lot more fascinating after transformers have bottomed out on their hype curve and become a tool when building specific types of models.

aix1 7 days ago | root | parent |

> embedding a pre-trained KGE model into a transformer model

Do you have any good pointers (literature, code etc) on the mechanics of this?

zxexz 7 days ago | root | parent | next |

Check out PyKEEN [0] and go wild. I like to train a bunch of random models and "overfit" them to the extreme (in my mind overfitting them is the point for this task, you want dense, compressed knowledge). Resize the input and output embeddings of an existing pretrained (but small) LLM (input only necessary if you're adding extra metadata on input, but make sure you untie input/output weights). You can add a linear layer extension to the transformer blocks, pass it up as some sort of residual, etc. - honestly just find a way to shove it in, detach the KGE from the computation graph and add something learnable between it and wherever you're connecting it - like just a couple linear layers and a ReLU. The output side is more important, you can have some indicator logit(s) to determine whether to "read" from the detached graph or sample the outputs of the LLM. Or just always do both and interpret it.

(like tinyllama or smaller, or just use whatever karpathy repo is most fun at the moment and train some gpt2 equivalent)

[0] https://pykeen.readthedocs.io/en/stable/index.html

zxexz 7 days ago | root | parent |

Sorry if that was ridiculously vague. I don't know a ton about the state of the art, and I'm really not sure there is one - the papers just seem to get more terminology-dense and the research mostly just seems to end up developing new terminology. My grug-brained philosophy is just to make models small enough you can just shove things in and iterate quick enough in colab or a locally hosted notebook with access to a couple 3090s, or even just modern Ryzen/EPYC cores. I like to "evaluate" the raw model using pyro-ppl to do MCMC or SVI on the raw logits on a known holdout dataset.

Really always happy to chat about this stuff, with anybody. Would love to explore ideas here, it's a fun hobby, and we're living in a golden age of open-source structured datasets. I haven't actually found a community interested specifically in static knowledge injection. Email in profile, in (ebg_13 encoded).

Jerrrrrrry 7 days ago | root | parent |

Thank you for your comments (good further reading terms), and your open invitation for continued inquiry.

The "fomo" / deja vu / impending doom / incipient shift in the Overton window regarding meta-architecture for AI/ML capabilities and risks is so now glaring obvious of an elephant in the room it is nearly catatonic to some.

https://www.youtube.com/watch?v=2ziuPUeewK0

napsternxg 7 days ago | root | parent | prev |

We also did something similar in our NTULM paper at Twitter https://youtu.be/BjAmQjs0sZk?si=PBQyEGBx1MSkeUpX

Used in non generative language models like BERT but should help with generative models as well.

zxexz 7 days ago | root | parent |

Thanks for sharing! I'll give it a read tomorrow - I do not appear to have read this. I really do wish there were good places for randos like me to discuss this stuff casually. I'm in so many slack, discord, etc. channels but none of them have the same intensity and hyperfocus as certain IRC channels of yore.

UniverseHacker 7 days ago | prev | next |

Really glad to see some academic research on this- it was quite obvious from interacting with LLMs that they form a world model and can, e.g. simulate simple physics experiments correctly that are not in the training set. I found it very frustrating to see people repeating the idea that “it can never do x” because it lacks a world model. Predicting text that represents events in the world requires modeling that world. Just because you can find examples where the predictions of a certain model are bad does not imply no model at all. At the limit of prediction becoming as good as theoretically possible given the input data and model size restrictions, the model also becomes as accurate and complete as possible. This process is formally described by the Solomonoff Induction theory.

slashdave 6 days ago | root | parent |

> At the limit of prediction becoming as good as theoretically possible given the input data and model size restrictions

You are treading on delicate ground here. Why do you believe that sequence models are capable of reaching theoretical maximums?

UniverseHacker 6 days ago | root | parent |

I do not think any real systems can ever achieve theoretically perfect Solomonoff Induction- only that increasingly good AI systems can be thought of as increasingly good approximations of this process. I do not know if any particular modeling approach has a fundamental dead end that limits its potential or not. However, my main point is that people claiming that they are certain of a particular fundamental limitation are mistaken. Current LLMs aren’t very intelligent, yet can already do specific things that people like Noam Chomsky have argued are fundamentally theoretically impossible for them to ever do.

slashdave 6 days ago | root | parent |

> However, my main point is that people claiming that they are certain of a particular fundamental limitation are mistaken.

No, they are correct. The architecture, by design and construction, is limited. This is simple math.

UniverseHacker 5 days ago | root | parent |

Limited how exactly? What limitation are you talking about, and what math proves it?

isaacfrond 7 days ago | prev | next |

I think there is a philosophical angle to this. I mean, my world map was constructed by chance interactions with the real world. Does this mean that the my world map is a close to the real world map, as their NN's map is to Manhattan? Is my world map full of non-existent streets, exits that are at the wrong place, etc. The NN map of Manhattan works almost 100% correctly when used for normal navigation but breaks apart badly when it has to plan a detour. How brittle is my world map?

gwern 6 days ago | root | parent | next |

One of the things about offline imitation learning like OP or LLMs in general is that the more important the error in their world model, the faster it'll correct itself. If you think you can teleport across a river, you'll make & execute plans which exploit that fact first thing to save a lot of time - and then immediately hit the large errors in that plan and observe a new trajectory which refutes an entire set of errors in your world model. And then you retrain and now the world model is that much more accurate. The new world model still contains errors, and then you may try to exploit those too right away, and then you'll fix those too. So the errors get corrected when you're able to execute online with on-policy actions. The errors which never turn out to be relevant won't get fixed quickly, but then, why do you care?

cen4 7 days ago | root | parent | prev |

Also things are not static in the real world.

narush 7 days ago | prev | next |

I’ve replicated the OthelloGPT results mentioned in this paper personally - and it def felt like the next-move-only accuracy metric was not everything. Indeed, the authors of the original paper knew this, and so further validated the world model by intervening in a model’s forward pass to directly manipulate the world model (and check the resulting change in valid move predictions).

I’d also recommend checking out Neel Nanda’s work on OthelloGPT, where he demonstrated the world model was actually linear: https://arxiv.org/abs/2309.00941

fragmede 7 days ago | prev | next |

Wrong as it is, I'm impressed they were able to get any maps out of their LLM that look vaguely cohesive. The shortest path map has bits of streets downtown and around Central Park that aren't totally red, and Central Park itself is clear on all 3 maps.

They used eight A100s, but don't say how long it took to train their LLM. It would be interesting to know the wall clock time they spent. Their dataset is, relatively speaking, tiny which means it should take fewer resources to replicate from scratch.

What's interesting though is that the Smalley model performed better, though they don't speculate why that is.

zxexz 7 days ago | root | parent | next |

I can't imagine training took more than a day with 8 A100 even with that vocab size [0] (does lightning do implicit vocab extension maybe?) and a batch size of 1 [1] or 64 [2] or 4096 [3] (I have not trawled through the repo and other wordk enough to see what they are actually using in the paper, and let's be real - we've all copied random min/nano/whatever GPT forks and not bothered renaming stuff). They mentioned their dataset is 120 million tokens, which is miniscule by transformer standards. Even with a more graph-based model making it 10X+ longer to train, 1.20 billion tokens per epoch equivalent shouldn't take more than a couple hours with no optimization.

[0] https://github.com/keyonvafa/world-model-evaluation/blob/949... [1] https://github.com/keyonvafa/world-model-evaluation/blob/949... [2] https://github.com/keyonvafa/world-model-evaluation/blob/949... [3] https://github.com/keyonvafa/world-model-evaluation/blob/mai...

IshKebab 7 days ago | root | parent | prev |

It's a bit unclear what the map visualisations are showing to me, but I don't think your interpretation is correct. They even say:

> Our evaluation methods reveal they are very far from recovering the true street map of New York City. As a visualization, we use graph reconstruction techniques to recover each model’s implicit street map of New York City. The resulting map bears little resemblance to the actual streets of Manhattan, containing streets with impossible physical orientations and flyovers above other streets.

fragmede 6 days ago | root | parent |

My read of

> Edges exit nodes in their specified cardinal direction. In the zoomed-in images, edges belonging to the true graph are black and false edges added by the reconstruction algorithm are red.

is that the model output edges, valid ones were then colored black and bad ones colored red. But it's a bit unclear so you could be right.

slashdave 6 days ago | prev | next |

Most of you probably know someone with a poor sense of direction (or may be yourself). From my experience, such people navigate primarily (or solely) by landmarks. This makes me wonder if the damaged maps shown in the paper are similar to the "world model" belonging to a directionally challenged person.

plra 7 days ago | prev | next |

Really cool results. I'd love to see some human baselines for, say, NYC cabbies or regular Manhattanites, though. I'm sure my world model is "incoherent" vis-a-vis these metrics as well, but I'm not sure what degree of coherence I should be excited about.

shanusmagnus 7 days ago | root | parent |

Makes me think of an interesting related question: how aware are we, normally, of our incoherence? What's the phenomenology of that? Hmm.

7 days ago | prev | next |

[deleted]

Jerrrrrrry 7 days ago | prev |

Once your model and map get larger than the thing it is modeling/mapping, then what?

Let us hope the Pigeonhole principle isn't flawed, else we can find ourselves batteries in the Matrix.

anon291 6 days ago | root | parent |

In the paper 'Hopfield networks are all you need', they calculate the total number of things able to be 'stored' in the attention layers, and it's exponential in the number of parameters. So essentially, you can store more 'ideas' in an LLM than there are particles in the universe. I think we'll be good.

From a technical perspective, this is due to the softmax activation function that causes high degrees of separation between memory points.

Jerrrrrrry 6 days ago | root | parent |

   > So essentially, you can store more 'ideas' in an LLM than there are particles in the universe. I think we'll be good.

If it can compress humanities knowledge corpus to <80gb unquanti-optimized, I think between my ironically typo'd double negative, and your seemingly genuine confirmation, to be absolute confirmation:

we are fukt