Why memory is the biggest bottleneck for AI

Technology · 6 min listen

Get the app on mobile

0:00

Transcript

HostWe hear so much about these new chips that can do trillions of things a second. It sounds like they have all the muscle they could ever need to run the world. But lately, people who build these systems are saying the real wall isn't the speed of the chip, it's the memory. I was thinking about this while my phone was struggling to load a big file. We focus so much on the brain of the machine, but maybe the problem is actually the desk space it has to work on. Why is the memory part of the machine suddenly the thing holding us back?

GuestIt helps to think of a chef who can chop a hundred onions a minute. He's incredibly fast. His hands are a blur. But if he only has a tiny cutting board, he has to stop every few seconds to clear it off or wait for someone to bring him a new bowl. That's what's happening inside our most powerful computers. We built these chips that are like super-fast chefs, but the table where they keep their data—the memory—is too small and too far away. The chip spends most of its life just sitting there, twiddling its thumbs, waiting for a piece of data to show up so it can actually do its job. In the industry, we call this the memory wall. The chips got much faster over the last few decades, but the speed of moving data in and out of memory didn't keep up. The gap between them is now a giant canyon.

HostBut if the chip is sitting idle, that seems like a massive waste of money. If I buy a fast car, I want to drive it fast. Why don't we just make the road to the memory wider so more data can get through at once?

GuestWe're trying, but we have run into some hard laws of nature. Electricity can only move so fast through wires. In a normal computer, the memory sits a few inches away from the brain. In the world of AI, an inch is like a long day's journey. By the time the data travels that inch, the chip has already finished its last task and is bored. So now, we have to stack the memory right on top of the chip, like a skyscraper, just to keep the distance short. This is what people mean when they talk about high bandwidth memory. It's basically a stack of memory chips glued together with thousands of tiny holes poked through them for data to flow. It's a nightmare to make. If even one of those tiny holes is slightly off, the whole thing is junk.

HostIt still feels like a solveable problem. We have been making memory sticks for our home PCs for years. Why can't we just plug in more of those? My laptop has plenty of RAM and it was pretty cheap.

GuestThe kind of memory in your laptop is like a slow, steady garden hose. It works fine for opening a web page or a word document. But AI is more like trying to put out a forest fire. When you ask a chatbot a question, it doesn't just look up one fact. It has to pull billions of numbers out of its memory to figure out what the next word should be. And it has to do that for every single word it types. If a thousand people are using it at once, that garden hose would just burst. You need a massive pipe. To get that kind of flow, you have to use that skyscraper design I mentioned. The problem is that these stacks of memory are now often more expensive than the chip itself. We have reached a point where the brain is cheap, but the memory is the luxury item.

HostSo that explains why these big AI models are so expensive to run. But I also notice they get slower the more I talk to them. If I give an AI a really long document to read, it starts to feel sluggish or it starts repeating itself. Is that a memory issue too?

GuestYeah, that's what we call the context window. Think of it like your own short-term memory. If I tell you a phone number, you can hold it in your head easily. If I read you a whole book and then ask what happened on page ten, you have to go back and look. AI has to keep every word of your current chat in its active memory so it can stay on track. The more you talk, the more of that precious, expensive desk space you're using up. Once that space is full, the AI has to start making tough choices. It might forget the beginning of the chat, or it has to spend a lot of time moving data around to make room. That's why long chats feel heavy. The machine is literally running out of room to think.

HostMaybe we're just building these things the wrong way then. If the memory is such a huge bottleneck, why don't we just make the AI models smaller so they don't need to remember as much? It feels like we're just throwing more hardware at a bad design.

GuestPeople are working on that, but there's a real trade-off. A smaller model is like a smaller brain. It's faster and fits on the desk, but it isn't as smart. It might get facts wrong or lose the thread of a complex story. So we're in this weird spot. We want the biggest, smartest brains, but we can't build a desk big enough to hold all their thoughts. And there's one more catch. Moving data actually takes more power than crunching the numbers. Most of the heat coming off an AI chip isn't from the chip thinking; it's from the energy needed to drag data back and forth across those tiny wires. We might hit a power wall before we even solve the memory wall.

HostThe energy it takes to move a thought is becoming more expensive than the thought itself.

GuestThe cost of just shifting those billions of numbers around is the real limit on how smart these machines can get.

HostOur super-fast chefs are still just waiting for someone to build a bigger table.

Made with Wander

A world of curiosity you can listen to. Explore endless questions, or ask your own.

Get the app