From Data Dumps to Daily Journeys: The Rise of Experiential AI
Link to our Book for Paid Subscribers
“We stand on the threshold of a new era in artificial intelligence that promises to achieve an unprecedented level of ability. A new generation of agents will acquire superhuman capabilities by *learning predominantly from experience.*”
~ David Silver and Richard Sutton
AI training data sets originally conisted of speech, images, then text, and then video files. While it wasn’t a complete progression from one to another, they were generally developed separately.
Speech recognition work began as early as the 1950s with Bell Labs' "Audrey" system, and significant developments continued through the 1970s with the application of Hidden Markov Models. These early speech recognition systems focused on numbers rather than words initially, with IBM's "Shoebox" in the 1960s understanding just 16 English words.
Image recognition saw major breakthroughs much later, with AlexNet in 2012 representing a pivotal moment in deep learning history when it dramatically outperformed other approaches in the ImageNet challenge.
The earliest roots of text/natural language processing emerged in the 1940s after World War II, with initial efforts focused on machine translation. The 1954 IBM-Georgetown demonstration, which translated Russian sentences to English using the IBM-701 mainframe, marked one of the first significant achievements in the field.
The Transformer architecture, introduced in 2017, marked another turning point. This innovation led to powerful models like BERT (2018) and the GPT series, with GPT-3 (2020) containing 175 billion parameters. These models demonstrated unprecedented capabilities in understandi
Then we had effective (it was attempted earlier) multimodal training (2024), where AIs were trained on images, then text, and audio and video files simutaenously. The goal of multimodal training is to develop AI models to process and find relationships between different types of data - typically images, video, audio, and text. Contrastive learning, masked-token prediction, audio-visual correspondence—force the network to discover which text snippets pair with which pixels or sound frames.
This lead to a number of advances.
Enhanced contextual understanding across modalities, allowing models to recognize connections between concepts represented in different forms
More coherent and accurate image generation that better aligned with textual descriptions and real-world knowledge
Improved accessibility features like more natural text-to-speech and more accurate speech recognition
Better video content analysis, including understanding scenes, actions, and temporal relationships
More sophisticated reasoning about real-world objects and relationships as represented across different media types
Reduced biases and inconsistencies that emerged from single-modality training
The ability to perform cross-modal translation tasks (describing images in text, generating images from text, etc.)
Development of more comprehensive world models that could integrate information across sensory domains
This also led to cool products. Once you share a latent space in training, you can caption an image, answer a spoken question about it, or describe a video frame—all with a single model. Products like Meta’s Ray-Ban smart-glasses and Google’s Project Astra demo this in practice.
Tools were also integrated into AI systems to expand their capabilities beyond simple pattern recognition and text generation, allowing them to overcome inherent limitations by connecting to external knowledge sources and specialized functions. This integration transformed AI assistants into more capable systems that could acknowledge their knowledge gaps and actively seek accurate, up-to-date information rather than relying solely on pre-trained parameters. For example, rather than potentially generating an incorrect or outdated answer about recent scientific discoveries, an AI with a research tool could recognize its uncertainty, search for reliable information from reputable sources, and provide a verified response with proper citations. This tool-augmented approach significantly improved accuracy in domains requiring specialized or current knowledge while maintaining the AI's natural language capabilities as the user-friendly interface—effectively combining the AI's reasoning abilities with the vast, constantly updated information available on the internet.
We are now moving to a new era where AIs aren’t trained on only sophisticated yet previously defined data sets and then augmented with tools. We are now entering an era where AIs are trained by their own interacrtions.
Recent research (paper) by Silver & Sutton argues that AI is entering an “Era of Experience,” where models improve primarily by acting in, and learning from, their own long-running streams of interaction rather than passively absorbing static text or image corpora. In this paradigm, an agent autonomously probes digital and physical environments, receives grounded reward signals (heart-rate changes, error rates, sales figures, etc.), and updates itself continuously—much like a scientist refining hypotheses through experiments. Because the data supply grows with every new action the agent takes, experiential learning side-steps today’s looming ceiling on high-quality human data and opens the door to truly novel capabilities, from self-designed material science experiments to multi-year personalised tutoring plans.
This has signficant implication for how “what AI models” do is undewrtood and taught.
1. An unlimited, ever-improving data pipeline
Every time an experiential agent acts, it creates fresh, high-signal data that are automatically tailored to its current competence level. This self-generated data stream grows exponentially with agent activity and never “runs out,” eliminating the looming ceiling on high-quality human text or image corpora and letting capabilities scale long after static datasets have been exhausted.
2. Freedom from inherited dataset biases
Because the agent’s learning signal comes from consequences in the environment— sensor read-outs, user retention, energy usage, profit, etc.—rather than from a frozen snapshot of human writing, its behavior is shaped by ground-truth outcomes, not by the historical quirks, stereotypes, or censorship baked into human-curated corpora. Over many interactions the agent can revise, unlearn, or outweigh early mis-generalizations instead of having them fossilized in its weights.
3. Continuous, lifelong adaptation (no “re-train, redeploy” cycle)
Agents inhabit a single, unbroken stream of experience. They accumulate memories across days, months, or years and update their policy online, the way animals do. That turns learning into a closed feedback loop: sense → act → receive reward → update → repeat. The result is rapid responsiveness to non-stationary worlds—new regulations, software versions, or social trends—without the downtime or cost of periodic offline re-training. 4. Grounded reward signals enable real-world understanding
Tying objectives to concrete, measurable outcomes (faster proof search, lower blood glucose, higher warehouse throughput) forces the model to build causal world models instead of merely predicting the next token. Those world models, in turn, support planning, reasoning, and counter-factual thinking—core AGI ingredients that pure language imitation has struggled to master.
4. Self-curated curricula and open-ended skill growth
Just as AlphaZero generated progressively harder games for itself, experiential agents can design experiments, tasks, or simulations that are always on the edge of their competence. This automatic curriculum keeps the learning signal rich and diverse, encouraging the emergence of new skills (coding tools, robotic manipulation, scientific hypothesis generation) without human hand-labeling.
6. Cross-domain transfer via unified sensorimotor streams
Because the same policy network updates while writing code, balancing a drone, or tutoring a student, it can discover abstract principles—optimization, decomposition, error correction—that generalize across modalities. Each new domain expands the agent’s latent world model, giving it more conceptual tools to reuse elsewhere, an essential step toward the flexible, domain-general competence we call AGI.
7. A built-in path to safe self-improvement
When rewards are tightly coupled to long-horizon metrics humans actually care about, we can monitor—and, if needed, modify—the agent’s incentives while it learns. Continual, online alignment feedback lets designers adjust goals before small mis-alignments compound into catastrophes, offering a more scalable safety lever than one-shot instruction-tuning on static preference datasets.
For students and educators, the takeaway is that understanding today’s LLM internals is only a starting point. Today's large language models may soon be replaced by more advanced technologies. As some experts predict (Yann LeCun) we might move beyond LLMs within two years. At best, they will likely be part of a human-level AI architecture (Goertzel, Hassabis, others)
This means students and educators must focus on developing broader theoretical understanding of AI and its concepts. This includes understanding fundamental AI principles like data management, pattern recognition, data classification strategies, association rules, regression analysis, and statistical modeling approaches (Dasey).
The most valuable educational approach combines foundational AI theory and concepts with critical thinking abilities, collaborative problem-solving, effective communication, and resilience in the face of rapid technological change. (Me and everyone else who has given this signficant thought).
Just as AI models will update themselves continuously, our educational curricula and personal skill sets must evolve at a similar pace to ensure humans remain effective partners in an increasingly intelligent technological ecosystem
Purchase a paid subscription and access our book and paid posts.