Stay in touch…


Read the latest Bitstream

RSS Feed


Look for us at LinkedIn


Follow us on Twitter

Mix Magazine

This installment of The Bitstream column appeared in the August 2003 issue of Mix Magazine.

The Bitstream

This column discusses the application of heuristics to digital audio…

Just Hum A Few Bars

By now, you’ve probably come across some appliance or service that recognizes human speech…your cell phone perhaps, or the customer service call–in line at your credit card company. What you may not have realized is that a related technology is at work, instigated by “The Man,” and put in place to listen to radio and TV transmissions solely to recognize your songs and performances. Why would anyone set up these little music spies and what’s going on with this technology is the topic of this month’s column.

There are several different machines that recognize audio, whether it be speech or music and, by and large, they all share one thing in common. They need to “listen to” and process a sample of any material that they will later recognize or match. This is an application of heuristics, learning from practical experience. As I mentioned before, several specialized audio recognizers of human speech are available from IBM, MacSpeech and ScanSoft, and I can tell you from seemingly endless hours of “practical experience” that machine recognition of continuous or natural speech is one of the toughest problems in computing.

Anyway, recognition of music is a good bit easier as any particular performance, once it’s recorded, is “etched in stone,” so to speak. The spectral makeup, timing and amplitude variations are fixed and only global gain changes, noise and distortion are added when the performance is reproduced. That fact has spawned several vendors selling recognition tools and services. One of these is Comparisonics Corporation, makers of the service. lets you type in descriptors and their engine will return the URLs of sites hosting possible sounds that match your needs. This can be useful for multimedia producers and musicians who are hunting for that perfect effect or sample. Another heuristic audio search product is SoundFisher, a cross–platform database management system featuring content-based recognition, matching and retrieval.

A more interesting and difficult application of music recognition technology is for digital rights management and performance metrics…this is where those machine spies come in. Two companies, Audible Magic and Relatable, are using their audio feature identification smarts to monitor network traffic, especially P2P activity, plus recordings on optical and magnetic media along with radio broadcasts. Audible Magic, in particular, has acquired quite a few companies, including SoundFisher’s developer, in an effort to be the one stop shop for controlling content in the chaotic world of modern media.

Both Relatable and Audible Magic have products that sniff IP packets and “listens” to the audio being carried within file transfers. They’ve tried to go beyond mere identification to actually block illegal files but, so far, it hasn’t worked as planned. The computational and network resources to recognize, validate and block illegal music–carrying packets in real time is still some ways away.

A third company, the solution provider formerly known as Cantametrix is now part of Gracenote, those CDDB guys. For those of you who don’t get out much, CDDB is the largest commercial database of CD metadata, which many MP3 player applications rely on to provide disc and song titles. According to Gracenote, their “information services are used by leading media players including AOL’s Winamp, Apple’s iTunes and RealNetworks’ RealOne Player.”  Leading CE manufacturers, including Pioneer, Philips and Sony, incorporate Gracenote’s CDDB technology into their latest generation of home, mobile and portable music products.

In addition to all the commercial products I’ve already mentioned, there are several Open Source or freely downloadable software whatsitz that do the heuristics dance. One is MusicBrainz’s Tagger, a Win application that “allows you to automatically look up the tracks in your music collection and then write clean metadata tags (ID3 tags or Vorbis comment fields) to your files. As you tag the files in your collection that MusicBrainz didn’t recognize, you submit the acoustic fingerprints (TRM IDs) of your files back to the server. Submitting acoustic fingerprints will allow MusicBrainz to automatically identify these tracks in the future, so that other people using the Tagger can benefit…” TRM IDs are profiles typically generated by Relatable’s TRM audio fingerprinting technology. A version of TRM’s audio feature extraction client was used by the MusicBrainz project.

Another no–cost machine is SWMUMDIS, a “…universal tool to develop and explore audio representations that process the ridges” of a preprocessed spectrogram. SWMUMDIS is a demonstration of research principals and not a product, even by Open Source standards, but it does serve at a departure point for further development by pointy headed programmers.

Other uses for music recognition include automatic quality assessment and visualization of parameters such as spectral content, which makes rapid identification of sections easier for editing. Another utility application is quality control. The International Telecommunication Union (ITU) created the PEAQ or Perceptual Evaluation of Audio Quality standard for objective machine evaluation of perceptually coded audio, of which the MP3 codec is a widespread example. Basically, PEAQ software “listens” to incoming audio, makes a evaluation of its quality based on a model of human hearing and that subjective factor we refer to as “quality,” and then rates the audio in real time. This is invaluable for broadcasters, replicators and anyone who needs a way to monitor their “product,” while never tiring or growing bored with the program material. PEAQ’s quality assessment is based on a group of trained human listeners whose talents were baked into software. PEAQ–based products are available as software–only and hardware implementations.

These days, the audio data sniffing field is crowded enough that participants are vying for mind share by claiming the fastest recognition time…“I can name that tune in a dozen notes!” “Hah, I laugh at your algorithms! I can name that tune in half as many!” …and so it goes until, at some point, the programs will be able to name that tune with just one note, and then we can all retire and let the computers do our work for us. The world of machine intelligence and audio recognition may someday provide a truly useful product to, say, automatically assemble a soundtrack for your life but, until then, audio recognition remains a useful tool primarily for bean counters and intellectual property cops. Just remember that, even in space, something can hear your Stratocaster scream…


OMas’ computer autoassembled this column while he was preparing a delicately toasted cheese sandwich. All that time, he and his PowerBook were under the influence of Morcheeba’s latest, Charango, and the wide ranging styles of new Brit pop kids Delays. (Here’s a live performance collection from the Beeb…)

Pedant In A Box

The buzz word for this month is…


A spectrogram is a visualization technique for acoustic events or audio material. Spectrograms provide a time versus frequency and amplitude plot and can be real or out–of–real time. Nowadays, most spectrograms map frequency to a predefined color table to visually clarify the plot. Forensic investigators, audio restorers and speech pathology routinely employ spectrograms in their work.

The following two spectrographs are from SoundHack and Frequency, the poor man’s Retouch. The color plot from SoundHack shows a stereo folk rock AIFF file. Notice the tempo appears as almost a grid of vertical beats while the monochrome Frequency screenshot displays my voice.

Soundhack sonogram
Figure 1 — An “Input Sonogram” from SoundHack

a sonogram from Frequency
Figure 2 — Frequency’s sonogram display

The selected utterance in the right pane is the word “SCSI.” For both, the X axis is time from left to right while the Y axis is frequency.