OpenAI’s new video generation tool could learn a lot from babies

“Ffirst text, then images, now OpenAI has a model to generate videos,” screamed Mashable the other day. The makers of ChatGPT and Dall-E have Sora just announced, a text-to-video distribution model. Excited comments all over the web about what will no doubt be known as T2V, covering the usual spectrum – from “Is this the end of [insert threatened activity here]?” to “meh” and everything in between.

Sora (the name is Japanese for “heaven”) isn’t the first T2V tool, but it looks more sophisticated than earlier efforts like Meta’s Make-a-video AI. It can turn a short text description into a detailed, high-definition movie clip up to a minute long. For example, the command “A cat wakes up its sleeping owner and demands breakfast. The owner tries to ignore the cat, but the cat tries new tactics, and eventually the owner pulls out his secret stash of treats from under the pillow to surround the cat holding off a little longer,” produces a slick video clip that would go viral on any social network.

Cute, huh? Well, up to a point. OpenAI seems uncharacteristically candid about the tool’s limitations. For example, it may “struggle to accurately simulate the physics of a complex scene”.

That’s putting it mildly. One of the videos in his example set illustrates the model’s problems. The incentive the movie produces is “Photorealistic close-up video of two pirate ships fighting each other while they sail in a cup of coffee”. At first glance, it is impressive. But then one notices that one of the ships is moving rapidly in an inexplicable way, and it becomes clear that while Sora knows a lot about the reflection of light in liquids, it knows little or nothing about the physical laws that govern the movements of galleons do not control. .

Other Limitations: Sora can be a bit vague about cause and effect; “a person can take a bite out of a cookie, but after that the cookie may not have a bite mark”. Tut, Tut. It can also “confuse spatial details of a prompt, for example by mixing left and right”. And so on.

Still, it’s a start, and will no doubt get better with another billion teraflops of computing power. And even though Hollywood studio bosses can easily sleep in their king-sized beds, Sora will soon be good enough to replace some types of stock video, just as AIs like Midjourney and Dall-E are replacing Shutterstock-type photography.

Despite its concessions about the tool’s limitations, OpenAI says Sora “serves as the foundation for models that can understand and simulate the real world.” This, it says, will be a “significant milestone” in the achievement of artificial general intelligence (AGI).

And this is where things get interesting. Remember, OpenAI’s corporate goal is to achieve the holy grail of AGI, and the company seems to believe that generative AIs represent a tangible step toward that goal. The problem is that getting to AGI means building machines that have an understanding of the real world that is at least on par with ours. Among other things, this requires an understanding of the physics of objects in motion. So the implicit bet in the OpenAI project seems to be that one day, given enough computing power, machines capable of predicting how pixels on a screen will move will also learn how the physical objects they depict will behave in real life. perform. In other words, it is a bet that extrapolation of the machine learning paradigm will eventually bring us to superintelligent machines.

But AIs capable of navigating the real world will need to understand more than how the laws of physics work in that world. They will also have to find out how people work in it. And to all who followed the work of Alison GopnikThis seems like a bit of a stretch for the kind of machines the world currently considers “AI”.

Gopnik is known for her research on how children learn. Watch her Ted Talk, What do babies think?, will be a salutary experience for technicians who think that technology is the answer to the intelligence question. Decades of research examining the sophisticated intelligence-gathering and decision-making that babies do when they play has led her to conclude that “Babies and young children are like the R&D department of the human species.” After spending a year watching our granddaughter’s first year of development, and especially observing her begin to figure out causation, this columnist is inclined to agree. If Sam Altman and the guys at OpenAI are really interested in AGI, maybe they should spend some time with babies.