How one omnimodel lets a robot see, reason, and act

Technology · 6 min listen

Get the app on mobile

0:00

Transcript

HostIt's pretty wild how much we just take for granted when we pick up a glass of water. Our eyes see the glass, our brain figures out how heavy it looks, and then our hand just knows how hard to squeeze so it doesn't slip or break. For a long time, if you wanted a robot to do that, you had to build it in separate pieces like a Lego set. You would give it one brain for seeing, another for talking, and a completely different one for moving its arms. But lately, people are talking about these omnimodels that do everything at once. How do we get from those separate parts to just one big brain that handles everything?

GuestIt really is a massive shift in how we think about machine intelligence. In the old days, and by that I mean just a few years ago, a robot was basically a group of specialists who didn't talk to each other very well. You had one system that was great at naming objects in a picture. You had another that was good at planning a path across a room. The problem was that things got lost in translation. It was like trying to build a house where the architect, the plumber, and the electrician all spoke different languages and could only send each other short, confusing notes. These new omnimodels change that because they use one single language for everything. They don't just see a cup and then tell a different part of the brain to grab it. They process the pixels of the cup and the movement of the arm in the exact same way, at the exact same time. It's all one big stream of information.

HostSo it's not like the eyes are sending a text message to the hand. But if it's all one stream, how does the brain know which part is a picture and which part is a physical shove or a reach?

GuestThat's the clever bit. Think about how a tool like a chat bot works. It takes words and turns them into little bits of code or math. We call these tokens. It turns out, you can do that with almost anything. You can turn a patch of color from a camera into a token. You can turn a voice command into a token. And here is the big breakthrough: you can turn the command for a robot to turn its wrist five degrees to the left into a token, too. When you put them all in the same bucket, the model starts to see patterns between them. It learns that when the camera shows a certain shape and the user says the word apple, the arm tokens usually follow a path toward that shape. It stops being a set of instructions and starts being a single flow.

HostThat sounds almost too simple. If I give a robot a single brain that has read the whole internet, does it actually understand what an apple is, or is it just really good at guessing what the next move should be?

GuestWell, that's where the friction is right now in the field. Some people say it's just a very fancy way of guessing. But think about what happens when you tell a robot to pick up the thing that shouldn't be on the table. An old-school robot would be lost. It would look for an object named thing that shouldn't be on the table and find nothing. But an omnimodel has read millions of pages of human talk. It knows that a dirty sock doesn't belong on a kitchen table. Because it has that big-picture knowledge from the internet, it can see the sock, reason that it's out of place, and then move the arm to grab it. It's using its knowledge of the world to guide its physical body. That's something a simple arm-moving program could never do on its own.

HostBut wait, if it's learning from the internet, the internet is full of stuff that has nothing to do with gravity or how much a gallon of milk weighs. How does it learn the actual physical rules of the world without breaking everything it touches?

GuestYou're hitting on a really tough problem. The internet can tell you that a glass is fragile, but it can't tell you exactly how it feels when that glass starts to tip over in your grip. To fix that, researchers have to give these models a bit of real-world practice. They let the robot try things out thousands of times, or they show it videos of people doing tasks. The model starts to blend its book smarts from the internet with its street smarts from the camera. It's kind of like if you read every book ever written about riding a bike and then finally sat on one. You would've a head start, but you still have to feel the balance for yourself. The omnimodel is doing that blending in real time.

HostIs there a limit to this? I mean, if we just keep making the brain bigger and giving it more data, will it eventually be able to do anything a human can do?

GuestWe're finding that size helps, but it's not everything. There's a real tension between being smart and being fast. If the brain is too big, it takes too long to think, and the robot becomes slow and clumsy. If you're catching a ball, you can't wait three seconds for a giant model to process the pixels. So the goal now is to make these brains lean and quick. We're also finding that they still struggle with very fine tasks, like tying shoelaces or peeling a grape. They're great at the big-picture stuff, like cleaning a room, but the tiny, precise movements still need a lot of work.

HostSo we have moved from a robot that's a bunch of separate parts to a robot that's one big, slightly clumsy student trying to learn everything at once.

GuestThe most surprising thing is that when these robots learn to talk and see at the same time, they actually get better at moving, which shows that those skills are more linked than we ever thought.

HostThe dream of a machine that can help out around the house seems a lot closer now that it's finally starting to use its eyes and its hands as one single tool.

Made with Wander

A world of curiosity you can listen to. Explore endless questions, or ask your own.

Get the app