How language models extrapolate outside the training data: A case study in Textualized Gridworld

Imagine teaching someone to navigate through a small maze. They quickly learn basic rules: avoid walls, find shortest paths, reach goals. But what happens when you present them with a much larger, more complex maze? Humans handle this leap with remarkable ease. For language models, however, this kind of extrapolation has been a significant challenge.

Recent advances in language models have shown remarkable capabilities in various domains, from natural language understanding to complex reasoning tasks. However, their ability to extrapolate beyond training data remains a significant challenge, particularly in structured problem-solving scenarios like path planning and navigation, even after fine-tuning^[1].

The Human Edge in Problem-Solving

Humans demonstrate remarkable abilities in generalizing problem-solving strategies across different scales and complexities. This capability stems from our cognitive architecture, which cognitive scientists have long studied and categorized into two primary systems ^[2]:

System 1: Quick, instinctive responses based on familiar patterns, operating automatically and with little effort
System 2: Careful, methodical thinking that creates mental maps of problems, engaging in conscious reasoning and strategic planning

This dual-process theory has profound implications for AI development, particularly in how we approach the challenge of making language models truly cognitive.

Recent approaches, such as Chain of Thought (CoT) reasoning^[3], show promise for mimicking System 2 reasoning via in-context learning. Theoretical findings suggest that complex problems outside the `TC^0 ` (polynomial-sized constant-depth circuits) complexity class can be solved with an ample amount of CoT tokens^[4]. However, the precise form of CoT required remains unclear. Our research is to configure the CoT variant that can show emergent cognitive map-like behavior in language models.

Extrapolation in Gridworld as a Probe of Cognitive Map

How do we know if the language model exhibits cognitive map-like behavior? We investigate this through extrapolation, a key characteristic of System 2 reasoning. By training a language model with simple demonstrations and testing its extrapolation capabilities in complex environments, we can determine if it possesses cognitive maps. Specifically, we utilize a Gridworld environment for several compelling reasons:

Cognitive science foundations: Probing mental representations through spatial tasks is foundational in cognitive science^[5]
Minimal knowledge requirements: Gridworld provides an ideal environment to probe the extrapolability of language models while minimizing the influence of world knowledge
Scalability: Gridworld allows us to systematically increase the problem space by expanding the grid dimensions
Complexity class and computational limitations: Gridworld has a complexity class outside `TC^0`, so a fixed-precision transformer can't solve it^[6]

To measure true extrapolability of language models, we designed a rigorous experimental setup:

Training Environment: Limited to 10x10 grids maximum
Testing Environment: Extended to 20x20 grids maximum
Key Challenge: Evaluating extrapolation beyond training distribution

Our test set features significantly longer input tokens and higher environmental complexity compared to the training set, allowing us to evaluate true extrapolation capabilities.

Heatmaps and Statistics of Train/Test set: Heatmap of (a, b) shows average input token length and complexity for different Gridworld sizes axb. The red box in the test set heatmap indicates the training set boundaries.

We observe clear differences between training and testing environments in both input length and complexity (measured by the negative log probability of successfully completing an optimal path using a random policy). Our hypothesis is that models equipped with cognitive map-like reasoning should be able to effectively handle test cases outside the training distribution (shown in red box) by leveraging their systematic planning capabilities.

Two planning scenarios: The difference between optimal planning (single-turn) and reachable planning (multi-turn) approaches in solving navigation tasks.

We test two types of planning scenarios in Gridworld: optimal planning and reachable planning. Optimal planning involves finding the shortest path to the goal, while reachable planning allows for multiple steps to reach the goal. The key difference lies in the complexity of the planning task and the need for offline planning before actual interaction. Optimal planning requires a comprehensive mental model of the environment, akin to cognitive maps, to find the shortest path efficiently.

Cognitive Maps for Path Planning: Planning with World Model through CoT

The cognitive map framework operates through three integrated stages: sampling, propagation, and backtracking. This architecture enables language models to simultaneously 1) build robust world model of novel environments and 2) perform model-based planning through CoT reasoning. At the sampling stage, the model extracts key environmental features and relationships. During propagation, it extends these initial observations into a broader cognitive map. The backtracking stage then leverages this representation to identify optimal solutions by evaluating potential paths and their outcomes. Our experiments demonstrate that this approach significantly enhances the model's ability to extrapolate beyond its training data, generating effective solutions for previously unseen scenarios without requiring external feedback or interaction.

The three-step process of cognitive map formation and utilization in path planning

Key Components:

Sampling: Systematic exploration of the state space to identify potential paths and obstacles
Propagation: Building a comprehensive mental model of the environment through information sharing between states
Backtracking: Efficient path planning by working backwards from the goal state

An example data instance for optimal planning, consists of input specification of the environment and the output consisting of CoT and optimal plan.

Experimental Results

We compared the performance of language models equipped with cognitive maps against baseline models with implicit (directly predicting the next action) and explicit (predicting the next action via conventional CoT) planning capabilities. Here are some key findings:

Performance comparison showing significant improvements in extrapolation capabilities with cognitive maps

Our experiments demonstrate that models equipped with cognitive maps only shows significant improvements in extrapolation capabilities, achieving higher success rates in unseen environments and more optimal paths in reachable planning scenarios.

Performance comparison of cognitive maps against baseline methods across different dimensions of extrapolation

We further support our analysis by examining model performance across problem complexity and input length. While baseline methods fail sharply beyond training boundaries, cognitive maps maintain significant performance at twice the training thresholds.

Additional Analysis and Discussion

Our experiments with cognitive maps in language models for path planning tasks reveal several intriguing insights that extend beyond extrapolation. Here are some key findings & discussions:

Rapid Adaptation During Training

Success rate and loss curves showing rapid adaptation of cognitive map (Green plot)

The models using cognitive maps demonstrate remarkably quick learning and adaptation to new scenarios. The success rate converges early (79.13% at step 500) and the loss curve shows faster convergence compared to baseline methods.

Unpromptable Nature of Cognitive Maps

Few-shot prompting results across different models and approaches

Advanced language models often struggle to generate meaningful cognitive maps through simple prompting, underscoring the necessity for specialized training methods. Notably, only GPT-o1 demonstrates some success in creating cognitive maps without explicit prompting.

This success may be attributed to GPT-o1's training or fine-tuning, which likely enhances its reasoning process, encourages exploration of alternative strategies, and improves its ability to iteratively recognize and correct mistakes. Consequently, GPT-o1 may be more proficient at forming mental representations of spatial relationships, even without explicit prompts for cognitive maps.

Comparison with Exploration-based Methods

Apart from offline planning with cognitive maps, there are numerous exploration strategies for language models, including Tree of Thoughts (ToT)^[7], which combines language model sampling with value estimation for tree search. We compared cognitive map approaches with exploration-based methods like DFS (Depth-First Search) where language models explore and traverse the environment iteratively until reaching the goal.

Performance comparison between cognitive maps and exploration-based planning

While exploration-based methods achieve higher success rates in reachable planning scenarios (94.5% vs 88.5%), cognitive maps demonstrate superior efficiency in path optimality. This can be an analogy to two distinct human planning strategies:

Offline Planning

Using cognitive maps to plan the entire route before taking the first step - like studying a map before starting a journey.

Online Exploration

Exploring and adjusting the route as you go - like using a GPS that recalculates when you take a wrong turn.

Each approach has its own tradeoffs. While offline planning using cognitive maps requires building a robust mental model upfront and can achieve more optimal paths, it demands significant computational resources for comprehensive planning. In contrast, online exploration is more flexible and requires less initial computation, making it easier to implement since it doesn't need a complete world model. However, this comes at the cost of requiring more interactions with the environment and potentially leading to less optimal paths. This mirrors human decision-making, where we sometimes prefer quick, adaptive solutions over carefully planned optimal routes.

Looking Ahead

This research opens up exciting possibilities for the future of AI:

Creating AI systems that can tackle increasingly complex problems in generalized tasks
Developing models that combine careful planning (offline planning) with flexible adaptation (online planning)

Key Takeaway

By teaching AI to think more like humans - building mental maps and planning ahead - we can create systems that don't just solve the problems they're trained on, but can tackle new, more complex challenges they've never seen before.

References

[1] Ida Momennejad, Hosein Hasanbeig, Felipe Vieira, Hiteshi Sharma, Robert Osazuwa Ness, Nebojsa Jojic, Hamid Palangi, and Jonathan Larson. Evaluating cognitive maps and planning in large language models with cogeval, 2023.

[2] Kahneman, D. Thinking, Fast and Slow, 2011.

[3] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.

[4] Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective, 2024

[5] John O’Keefe and Lynn Nadel. The Hippocampus as a Cognitive Map. Oxford: Clarendon Press, 1978.

[6] William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers, 2023.

[7] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023.