Transcript
HostIt always feels a bit like magic when you're in a loud, crowded bar and you hold your phone up to a speaker to find out what song is playing. Even with all that shouting and the sound of glasses clinking, the app gets it right in just a few seconds. I have been wondering how it manages to hear the music at all through all that noise.
GuestWell, the secret is that the app doesn't really listen to the music the way we do. It actually ignores almost everything it hears and focuses on a tiny few bits of the track. To make sense of the sound, the app first turns those shaking air waves into a picture. We usually think of audio as a wavy line, but the app builds something called a spectrogram. Think of it like a map of the sound. On the bottom, you have time moving forward. On the side, you have the pitch, from low bass to high chirps. And the loudness of each sound is shown by how bright the color is.
HostThat sounds a bit like looking at a heat map for a song. But I don't see how turning a sound into a picture actually makes the search faster.
GuestIt changes the whole math of the problem. Instead of comparing two sound files, which is slow and heavy, the app uses math for spotting patterns in images. It treats the song like a digital picture rather than a sound file, which lets it move much faster.
HostBut a bar isn't a clean picture. It's messy. You have people talking, wind blowing, and chairs moving. I would think all of that would clutter up the image.
GuestThat's where the app gets really picky. It only looks for the very brightest spots on that map. These are the highest energy peaks, the loudest parts of the song at any given moment. Usually, those are things like a heavy drum hit, a sharp vocal note, or a guitar riff. It ignores about ninety nine percent of the audio data and just keeps these few bright dots. It's like looking at the night sky. Most of the sky is dark, but you can see the brightest stars. Because the noise of a crowded room is much quieter than the loudest parts of the music, the noise just falls below the line and disappears. You're left with a sparse constellation map of the song.
HostOkay, so we have a map of dots. But there are millions of songs out there. A lot of them must use the same notes at the same volume. I'm not sure how a few dots can tell one song from another.
GuestYou're right, one dot isn't enough to name a song. The big breakthrough came from a researcher named Avery Wang, who helped start Shazam. He realized you shouldn't just look at the dots on their own. Instead, the app pairs them up. It picks one dot as an anchor and links it to another dot that happens a moment later. Then it records three simple details: the pitch of the first dot, the pitch of the second dot, and the exact amount of time between them.
HostSo it's not just looking for one note, it's looking for how two notes fit together in time?
GuestYeah, and it squishes those three details into one single number called a hash. That number is like a unique fingerprint. By focusing on the connection between the dots rather than the dots themselves, the fingerprint stays the same. It doesn't matter if your phone microphone is cheap or if the sound is a bit fuzzy. That link between the notes is still there.
HostEven with those fingerprints, there must be billions of them in the database. How does it find the right one so quickly when I only give it a few seconds of a song?
GuestWhen your phone sends that list of hashes to the big database, it doesn't just look for matches. It looks for a steady gap in time. For example, the app might find fifty matches for those little fingerprints. It then checks to see if all fifty of them are in the right place. If all those matches are exactly forty two seconds behind where they sit in the studio version of a track, the app knows it found a match.
HostSo that's why it can catch a song even if I start recording right in the middle of a guitar solo or at the very end of the chorus.
GuestIt just needs to find that one steady time offset. It can search through millions of songs and billions of these dot pairings in the blink of an eye because the data it sends is so light and simple.
HostIt's a strange thought that while we're straining to hear a melody over the roar of a bar, the phone is just connecting the brightest stars in a silent picture.
Made with Wander
A world of curiosity you can listen to. Explore endless questions, or ask your own.
Get the app