Powering Progress: The Fusion of Scaling Advances in Artificial Cognitive and Robotic Intelligence
Introduction
It’s been a while since I wrote a post updating AI developments. There are a couple of reasons for this.
I’ve been busy with the start of the debate season and some presentations.
Developments, even at a broad scale, have been rapid. I’ve always tried to focus on the bigger picture, but trying to bracket/conceptualize the role of every development is difficult. More developments at the application layer compound the challenge. Writing a post that lists developments is easy, but I hope to add unique value by contextualizing, which is hard.
That said, I’ll take a shot at this. Here it goes —
The development of superintelligent AI is moving at a breakneck pace, with physical embodiment becoming increasingly feasible. While we can't pinpoint exactly when we'll reach human-level AI, the trajectory is clear, and more and more scientists say that we should prepare for the possibility that it could be here within 5 years or less.
They make a simple point that we need to begin preparing now because *we can’t wait until it arrives to act if we are going to manage the transition, and we shouldn’t assume it will take longer than 5 years, even if it ends up doing so.
I’ll add that, honestly, it doesn’t matter that much when we get there; the emerging capabilities along the path to AGI, whether obtainable or not, will change our world. I make this point in every presentation, and Ethan Mollick published a long post about it yesterday.
In today’s post, I’ll look at two broad-brush developments in cognitive intelligence (for lack of a better term) and then at robotics developments. These will be fused together to create super-intelligent creatures. In 20 years or less, today’s chatbots will seem like amoebas.
In the next post, I’ll focus on some new applications.
Advances in Cognitive Intelligence
Former OpenAI AGI Safety Director Miles Brundage posted on Saturday:
For those of you keeping track, the only on the list computers can’t currently do is the last one.
They are certainly getting better at tests.
These advances can be attributed to two basic phenomena, though other factors are involved (e.g., fine-tuning, training on specific data, RLHF, RAG, etc.).
Scaling at Inference
Scaling at inference is referenced in the last paragraph of the above screenshot. Inference is when the model processes the query/prompt the user sends to the model.
Brundage is referring to the idea that models think a bit (and will soon think more) before responding to a prompt.
It is easiest to understand this through a poker example, which is how OpenAI’s Noam Brown proved it.
Brown discovered that training methods designed to teach AIs to play poker better than humans were missing something fundamental: how humans think before acting. Instead of just having AI systems learn from playing millions of poker hands, he developed a system that mimicked how professional players think through their decisions before making a play. This "chain of thought" approach meant the AI would consider multiple scenarios and their consequences before moving, just like a human player would pause to think through their options. The critical difference was teaching the AI to "think first, act second."
Rather than immediately responding to situations based on pattern recognition from previous hands, Brown's system would analyze potential outcomes, consider opponent reactions, and evaluate the long-term implications of each decision. This methodical approach, combined with the ability to process vast amounts of data, gave the AI a significant advantage over human players, who could only think through a limited number of scenarios.
It hasn’t lost to a human since.
This is considered the foundation of OpenAI’s reasoning model in ChatGPT-1o. While it is just in preview, the above charts show that it is consequential.
This is similar to the distinction between “system 1” and “system 2” thinking, which was introduced by psychologists Daniel Kahneman and Amos Tversky, who pioneered the field of behavioral economics.
The critical differences between system 1 and system 2 thinking are:
System 1 thinking:
Automatic, intuitive, and fast.
Requires little effort or conscious control.
Operates based on heuristics, biases, and gut reactions.
Examples include recognizing faces, reading words, and basic arithmetic.
System 2 thinking:
Deliberate, analytical, and slow.
Requires focused attention and conscious effort.
Involves logical reasoning, computations, and problem-solving.
Examples include solving complex math problems, critically evaluating arguments, and making important decisions.
Kahneman and Tversky found that humans often rely on system 1 thinking, which is more efficient but can lead to systematic biases and errors. In contrast, system 2 thinking is more effortful but can overcome the limitations of system 1 and lead to more rational and accurate decisions.
Scaling at Training
Models are also being improved through scaling at training, which simply means training AIs on more data using more compute. Experts like Andrej Karpathy, Fei-Fei Li, and Dario Amodei have voiced that while theoretical limits to scaling might exist, no one has identified them yet, and we haven’t yet hit them. This means companies continue expanding the size and complexity of AI models, exploring new frontiers in reasoning, language processing, and decision-making.
To support this growth, companies are leveraging extensive GPU clusters. For example, both X (formerly Twitter) and Meta run large-scale models on vast clusters, and Anthropic and OpenAI likely use similar setups. This infrastructure allows for continuous increases in model size and data handling capabilities, paving the way for training models on unprecedented scales.
These models had already been trained on the entire public internet, but now they are also trained on proprietary and synthetic data. These companies can amplify the training material by generating high-quality synthetic data without relying solely on real-world data. Some estimates suggest that upcoming models could be trained on around 50 trillion tokens (40 trillion words + image/video data (see below)), an unimaginable amount just a few years ago. This combination of increased hardware capacity, synthetic data, and continuous scaling suggests that the potential for even larger, more capable models remains vast, with no hard limits in sight.
And, of course, the models no longer just train on text. In multimodal training, AI models learn from various types of data, such as images, audio, video, sensor data, and text. This approach allows the models to develop a more comprehensive understanding of the world, capturing context and associations across different modalities. For example, when trained on text and images, a multimodal model can better understand and generate descriptions of visual scenes or recognize objects within images based on contextual language. Similarly, incorporating audio can help the model understand and respond to spoken language or analyze and interpret sounds like music and environmental noise. By synthesizing information from multiple sources, multimodal models can tackle a wider range of tasks, from creating detailed image captions to answering complex questions requiring visual and textual comprehension, thereby expanding the boundaries of AI capabilities.
The Combined Significance
The combination of massive scaling during training and inference (or "thinking") will likely advance machine intelligence significantly. This training scaling creates a foundational understanding that enhances the model's ability to perform complex tasks, understand nuanced language, and learn across various knowledge areas. The depth and breadth of learning achieved through this scaling provide the model with a “generalized” knowledge foundation that can be applied to various tasks.
Scaling at inference furthers this foundation by enabling models to dynamically adapt and respond more intelligently to specific, complex queries or tasks. It lets models allocate more computational resources as needed, deepening their capacity to "reason" through difficult or unfamiliar problems rather than relying solely on fixed responses derived from their training phase. This ability to apply computational power adaptively allows for iterative problem-solving, where the model can internally refine answers by checking assumptions, testing multiple potential solutions, or drawing from a more comprehensive set of its learned knowledge.
Together, these forms of scaling bring machine intelligence closer to human-like flexibility and insight. They allow models to recognize patterns and actively engage in problem-solving, logical reasoning, and dynamic decision-making. This synergy between training and inference scaling potentially lays the groundwork for more autonomous and contextually aware AI systems capable of handling tasks that demand deeper understanding and potentially complex, multi-step reasoning.
Advances in Robotics and “Bodily Intelligence”
The most recent notable development in robotics was the unveiling of EngineAI Robotics' SE01 humanoid robot, which achieved a breakthrough in human-like movement. This robot is the first to demonstrate natural walking patterns without the typical robotic gait, using advanced end-to-end neural networks and specialized harmonic joint modules.
As robots move about the physical world, they will have to learn more about how to do so. Many new developments by Meta are helping robots thrive in the physical world.
On top of Meta’s efforts, MIT recently introduced a new way to train robots that relies on vast amounts of diverse data, similar to how large language models (LLMs) like GPT-4 are trained. Traditional methods, where robots learn by imitating humans, often fail when small changes—like different lighting or obstacles—arise, as the robots lack enough data to adapt. To address this, the MIT team developed Heterogeneous Pretrained Transformers (HPT), which combines information from multiple sensors and environments. Using larger transformers improves training, allowing users to specify the robot’s design and task goals. This is expected to lead to further breakthroughs.
Here is another robot, just for fun —
Reinforcing Intelligence
Cognitive and embodied intelligence in robots work together to create more adaptive, effective, and human-like machines, each supporting and enhancing the other.
Cognitive Intelligence involves the robot's ability to process information, make decisions, and learn from data. It enables robots to plan, solve problems, recognize patterns, and understand complex commands. However, purely cognitive robots without embodied intelligence that builds from learning about their interaction with the physical world might struggle in real-world environments, as they lack the physical interaction to fully understand and respond to dynamic changes.
Embodied Intelligence relates to the robot's physical presence and ability to interact with the environment through sensors, actuators, and movement. This gives robots spatial awareness and allows them to interpret physical feedback, such as the texture of an object or the force needed to pick something up. Without cognitive intelligence, embodied intelligence is limited to repetitive, programmed tasks without any adaptation or decision-making abilities.
Robots can achieve more complex behaviors when cognitive and embodied intelligence work together. For example:
Enhanced Learning. Embodied intelligence allows robots to learn by doing, which reinforces cognitive learning. For example, a robot with cognitive intelligence might understand that it needs to grasp an object, but embodied intelligence allows it to adjust its grip based on sensory feedback.
Adaptive Interaction. Embodied intelligence allows the robot to interact with the environment, and cognitive intelligence enables it to interpret those interactions, allowing for more adaptive responses. For example, cognitive intelligence plans the route when navigating a cluttered environment, while embodied intelligence senses and adjusts movement in real time.
Improved Perception and Understanding. Embodied intelligence helps robots gather sensory data, like touch, which cognitive intelligence can interpret and use to refine actions. For instance, a robot vacuum with cognitive intelligence might map the room, while embodied intelligence allows it to feel resistance if it hits an obstacle and adjusts its path accordingly.
The synergy between cognitive and embodied intelligence is a powerful reinforcement that dramatically expands the capabilities of machine intelligence. By combining the data-processing, decision-making, and learning abilities of cognitive intelligence with the physical presence, sensory feedback, and adaptability of embodied intelligence, robots gain a richer, more nuanced understanding of the world around them. This partnership allows robots not only to interpret information but also to experience it in a way that reinforces learning and adaptation over time. The dynamic interaction between these two forms of intelligence is set to revolutionize machine capabilities, making robots far more adept at handling complex tasks, responding to unpredictable environments, and autonomously improving their own functionality. This combined intelligence will push the boundaries of machine capabilities, bringing a new era of robotics that is more attuned, versatile, and integrated into human contexts than ever before.
Conclusion(s)
I’ll leave you to draw your own conclusion(s).
I conclude that we should start thinking more about adapting to this as a society. That’s more important than using AI to generate a quiz or a worksheet about the last industrial era.