AI has a big problem with truth and correctness, and human thinking seems to be a big part of that problem. A new generation of AI is now starting to take a more experimental approach, potentially allowing machine learning to leapfrog far beyond humans.
Remember Deepmind’s AlphaGo? It was a fundamental breakthrough in AI development, as it was one of the first gameplay AIs to read neither human instructions nor rules.
Instead, they used a technique called self-play reinforcement learning to build their own understanding of the game. Pure trial and error across millions and even billions of virtual games. Start by pulling the available levers more or less randomly and try to learn from the results.
Within two years of the project’s inception in 2014, AlphaGo had defeated the European Go champion 5-0, and by 2017 it had defeated the world’s No. 1-ranked human Go player.
At this point, DeepMind has unleashed similar AlphaZero models into the chess world, with models like Deep Blue trained on human thinking, knowledge, and rulesets defeating human grandmasters since the 90s. Ta. AlphaZero played 100 matches against the reigning AI champion Stockfish, winning 28 and drawing the rest.
Human thinking puts a brake on AI
Deepmind abandoned the idea that emulating humans was the best way to get good results and began dominating games like shoji, Dota 2, and Starcraft II.
Bound by different limitations and endowed with different talents than ours, these electronic brains have unique ways of interacting with things, exerting their own cognitive powers, and understanding what works and what doesn’t. We were given the freedom to build our own foundational understanding. That’s it.
AlphaZero doesn’t know as much about chess as Magnus Carlsen. I’ve never heard of The Queen’s Gambit or researched the Great Grandmaster. It has simply played a shitty game of chess, constructing its own understanding of the cold, rigid logic of winning and losing in an inhuman and incomprehensible language of its own making in the process.
You will know that the RL has been done properly when the model no longer speaks English in the chain of thought.
— Andrei Karpathy (@karpathy) September 16, 2024
As a result, it far outperforms any model trained by humans, and this is an absolute certainty. With sophisticated reinforcement learning agents on the one hand, neither humans nor models trained on human thinking get a second chance at a game of chess. side.
And according to the people in a better position to know the truth than anyone else on the planet, something similar is starting to happen with the latest and greatest version of ChatGPT.
OpenAI’s new o1 model begins to diverge from human thinking
ChatGPT and other large-scale language model (LLM) AIs, like early chess AIs, are trained on more or less all available human knowledge: the entire written output of our species. .
And they got very, very good. It’s all sweet talk about whether we can achieve artificial general intelligence… Unfortunately, can you imagine a human being able to compete with GPT-4o beyond its capabilities?
However, the LLM specializes in language rather than determining factual truth. That’s why they “hallucinate” and convey false information in beautifully worded sentences, sounding as confident as a newscaster.
Language is a strange collection of gray areas, and there are rarely 100% right or wrong answers. Therefore, LLMs are typically trained using reinforcement learning with human feedback. In other words, humans choose answers that seem close to the type of answer they were looking for. But the fact is that testing and coding have clear success/failure conditions. Either I got it right or I didn’t.
And this is where the new o1 model begins to move away from human thinking and introduce the highly effective AlphaGo approach of pure trial and error in pursuit of the right outcome.
Baby steps to o1 reinforcement learning
In many ways, o1 is much the same as previous versions, except that it incorporates a “think time” before OpenAI begins responding to prompts. During this thinking time, o1 generates a “chain of thought” that considers the problem and makes inferences.
And this is where the RL approach comes into play. o1 actually “cares” whether things are right or wrong, unlike previous models that were more like the world’s most advanced autocomplete systems. And through part of its training, the model was given the freedom to approach problems with a random trial-and-error approach in a chain of thought and reasoning.
Although we still only had access to human-created reasoning steps, we were free to apply them randomly and draw our own conclusions about which steps, in what order, were most likely to lead to the correct answer.
In that sense, this is the first LLM to really begin to produce a strange but highly effective AlphaGo-style problem space “understanding”. In fields that are now beyond Ph.D.-level ability and knowledge, it is essentially by trial and error, by chance finding the correct answer through millions of self-generated attempts, and by building one’s own theory of what is correct. , we got there. Reasoning steps that are useful and those that are not.
So, in topics where there are clear right and wrong answers, we are now beginning to see this alien intelligence take its first steps past us on its own two feet. If the game world is a good analogy to real life, folks, we all know what’s going to happen from here. It’s a sprinter that will keep accelerating forever if given enough energy.
However, o1 is still primarily trained in human language. That is very different from the truth. Language is a low-resolution, crude representation of reality. Let’s put it this way. You can describe biscuits to me all day long, but I would never have tasted it.
So what happens if we stop trying to explain the truth of the physical world and feed an AI a biscuit? An AI embedded in a robot’s body develops its own fundamental understanding of how the physical world works. We’ll soon start to find out, as we’re starting to build.
AI’s path to ultimate truth
Freed from the crude speculations of humans like Newton, Einstein, and Hawking, embodied AI will take a bizarre AlphaGo-style approach to understanding the world. They poke and prod at reality, observe the results, and construct their own theories in their own language about what works, what doesn’t, and why.
They don’t get close to reality like humans or animals. They used scientific methods like ours, divided things into fields such as physics and chemistry, and helped humanity master the matter, forces, and energy sources around them to dominate the world. I don’t run the same type of experiments that I did.
An embodied AI given the freedom to learn in this way would be hilariously weird. They do the strangest things you can think of, for reasons known only to themselves, and in doing so create and discover new knowledge that humans could never have combined.
Unfettered by our language and thinking, they are able to break through the limits of our knowledge and discover cosmic truths and new technologies that humans will never encounter in a billion years. Not even.
We have some leeway here. Unlike much of what happens in the LLM world, this is not something that will happen in a matter of days or weeks.
Reality is the highest resolution system known to us and the ultimate source of truth. However, the amount is very large and it is also very slow. Unlike simulation, in reality you have to operate at a painfully slow rate of 1 minute per minute, and you can only use as many bodies as you actually build.
Therefore, an embodied AI that seeks to learn from basic reality will initially not have the fast-and-wild advantage of its language-based ancestors. But they would be much faster than evolution because of their ability to pool learning between cooperative groups in collective learning.
Companies like Tesla, Figure, and Sanctuary AI are working hard to build humanoids that reach standards that are commercially useful and cost-competitive with human labor. Once we achieve that, we will be able to build enough robots to understand the physical world from the ground up, through trial and error, at scale and quickly.
However, you will have to pay the fees yourself. It’s interesting to think, but these humanoids might learn how to master space during their breaks at work.
Sorry for the rather esoteric and speculative thoughts, but I find myself saying to myself what a wonderful time it is to be alive.
OpenAI’s o1 model may not seem like a quantum leap, as it sits there in GPT’s drab text garb, looking like just an invisible terminal typist. But this is actually a step change in the development of AI, and a glimpse into exactly how these alien machines will eventually surpass humans in every conceivable way.
To learn more about how o1 is revolutionizing AI development through reinforcement learning, we highly recommend the following videos from the excellent AI Explained channel:
o1 – What’s going on? Why o1 is the third paradigm of models and 10 things you may not know
Source: OpenAI / AI Explained