Devlin's Angle

February 2002

The Math of Online Music Trading

The High Court may have put Internet music trading company Napster out of business, but illegal swapping of music files on the web continues unabated, with companies like Audiogalaxy, Kazaa, Morpheus, and Winmx filling the void that Redwood City, California based Napster left behind.

Yet how many teenagers who clog the phone lines transferring the latest pop songs realize that the entire enterprise is built on mathematics -- that what they are really downloading are streams of numbers, computed using a calculus-based technique first developed in the early nineteenth century?

The mathematics used to convert music into numbers has its origins in the work of the French mathematician Joseph Fourier. Although his interests were in heat dispersal, the technique he discovered enables mathematicians to represent any wave form as a sequence of numbers.

Fourier showed how any function x(t) of time that is periodic (i.e., repeats itself indefinitely at regular intervals of time) can be represented as the sum of an infinite series of sines and cosines. Precisely, if T is the interval at which x(t) repeats itself (so x(t+T) = x(t) for all t), and if we set f = 1/T (f is called the base frequency of x), then x(t) is equal to a constant a(0) plus all terms of the form

a(k)cos(2.pi.k.f.t) + b(k)sin(2.pi.k.f.t)
as k runs from 1 up to infinity. This infinite sum is called the Fourier expansion of the function x(t). By virtue of the Fourier series, the sequences of numbers a(k) and b(k) are uniquely associated with the function x(t). Given a reasonably nice formula for the function x(t), it is possible to compute each of the constants a(k), b(k). The mathematical process that carries out the computations of these constants is called a Fourier Transformation. (These days, mathematicians use a computational (as opposed to analytical) analog of the Fourier Transform called the Discrete Fast Fourier Transform, which produces numerical approximations as accurate as desired.)

There is, of course, no possibility of actually computing all of these numbers, since there are infinitely many of them. However, a finite number of them will suffice to give a finite sum of terms of the Fourier series that represents x(t) within any desired degree of approximation. In the case where x(t) is the wave form of a sound wave, this provides a way of coding sound as a sequence of numbers. At least, in principle it does. An obvious question is how do you go from a sound wave to a mathematical formula x(t) that represents it in a way that lets you apply Fourier's analysis?

A sound wave consists of a ripple in the air. What makes it sound is that our ears and more generally our hearing system interpret that air wave as sound. To give a bit more detail, the motion of the air causes a skin membrane in the inner ear to vibrate, and those vibrations are converted into tiny electrical currents that flow into the brain. It is those electrical waves that the brain actually experiences as sound. In other words, when it comes down to it, sound is ultimately an electrical wave.

A microphone works in essentially the same way, converting an incoming sound wave in air into an electrical signal. If we feed that electrical signal into a loudspeaker, then the speaker recreates (a copy of) the original sound wave. But we can also do something else to that electrical wave: we can use a method known as sampling to generate a sequence of numbers. The most common procedure is called Pulse Code Modulation (PCM). This takes as input an electrical wave and measures the voltage of the signal at moments of time a small, fixed interval apart. In the case of an audio compact disk, the sampling is done 44,100 times a second. Thus, for each second of sound input, the PCM analog-to-digital converter generates 44,100 numbers, each one the measurement of the voltage at the instant it is sampled.

In the case of a compact disk, each voltage is measured to 16-bit accuracy; that is, the system can distinguish up to 65536 (= 2^16) different voltages. A sample rate of 44,100 per second coupled with 16-bit voltage measurement is sufficient to encode any sound as a sequence of numbers that, when converted back into sound, the human ear cannot distinguish from the original.

However, it takes a lot of storage capacity to capture even a three minute pop song in this fashion. A typical musical compact disc carries a stereo sound signal, each sampling measures two voltages, one for each of the two stereo sound sources, and each has 16-bit capacity, so each second of audio generates 2 x 16 x 44100 = 1,411,200 bits. Hence, a minute of CD quality music requires a massive file of 10 megabytes. Given modern compact disk technology, this is fine for the recording industry and the CD industry, but would create a major problem if people started shipping CD music files around the Internet.

Anyone with a modern desktop PC is aware that there are algorithms that can compress binary data files so they require less storage space. (PK-ZIP and Stuffit are two well known examples.) When applied to a typical text file, these packages can reduce the size of the file by as much as 80%, but with CD quality PCM files the reduction is only around 10%. Algorithms specially designed to operate on PCM files have managed to achieve 60% reduction, but that is clearly nothing like enough to support Internet music swapping. The WAV compression system familiar to Microsoft Internet Explorer users is essentially a ZIP compression of a raw sampled audio wave.

The key to shipping music files over the Internet is to abandon the idea of compressing the entire digital file so that the original sampled sound wave can be reproduced exactly (so-called lossless compression) and instead deliberately discard some of the information (what is called lossy compression). The folks who develop these algorithms begin by understanding how the human hearing system works. The aim, after all, is to produce a digital waveform that, when played back through a good loudspeaker system, sounds like the original sound. Thus, anything in the original sampled sound wave that the human hearing system cannot detect may be discarded. Regardless of how good we think our musical ear might be, it turns out that there is a lot of stuff that can be thrown away without our noticing.

A significant saving of storage space is a consequence of the phenomenon known as audio masking. This is where we hear two sounds at different energy levels at nearby frequencies. What happens is that one sound obscures the other. Hence, in coding the sampled signal, the lower energy component that is masked can be ignored. In practice, this is done as follows. It appears that the way our auditory perception works, incoming sound can be divided into separate bands, arranged logarithmically, with more bands occurring at lower frequencies. Frequencies in one band do not interact with those in another band, and our auditory system can either attend to all the bands at once or can select certain bands to pay attention to, such as the bands where the frequencies of speech occur. Within a given band, however, frequencies at significantly greater volumes tend to obscure any frequencies of lower volumes, and lower frequencies will tend to mask higher ones. Within each band therefore, masked information may be discarded.

By far the most familiar form of audio encoding (lossy compression) in use today -- and the base of both a huge industry and an even bigger illegal music trading network -- is MP3, which is short for MPEG-3, or even more fully MPEG - level 3. MPEG is a set of industry standards created and managed by the Moving Picture Experts Group (MPEG), which is a working group of ISO/IEC (International Organization for Standardization/International Electro-technical Commission) in charge of the development of standards for coded representation of digital audio and video. Established in 1988, the MPEG group has produced MPEG-1, the standard on which Video CD and MP3 are based, MPEG-2, the standard for Digital Television set top boxes and DVDs, MPEG-4, the standard for multimedia for the fixed and mobile web, and MPEG-7, the standard for description and search of audio and visual content. The most important differences between these standards are data rate and applications. MPEG-1 has data rates on the order of 1.5 Mbit/s, MPEG-2 has 10 Mbit/s, and MPEG-4 has the lowest data rate of 64 Kbit/s.

The original MPEG standard was divided into three levels: Level 1: System aspects; Level 2: Video compression: Level 3: Audio compression. MPEG Level 3, or MP3 as it is now widely known, was developed in 1992 by the German Frauenhofer Research Institute, and is part of the MPEG-1 and MPEG-2 specifications. It achieves a spectacular compression ratio of a sampled audio wave, ranging from a factor of 8 to a factor of 12, depending on the source. This means that the 10 MB of storage capacity to encode 1 minute of hi-fi music on a compact disk is reduced to 1 MB on a computer hard drive. Not surprisingly, the MP3 process is patented, with Thompson Multimedia holding the patents in the USA and Germany, although that has not prevented a proliferation of MP3 players being available for free download on the Internet.

MP3 divides the frequency range into 32 bands. The component of the input signal (sampled wave) in each those ranges is then subjected to a modified discrete cosine transformation that separates it into a further 18 constituents, generating a total of 576 individual frequency bands. It is within those bands that redundant (i.e., masked) components are removed. The resulting signal is then compressed further by Huffmann coding, a technique familiar to computer scientists, which represents frequently occurring values by shorter codes than used for less frequently occurring values. (For instance, it would be highly wasteful to use the default 141,120 bits of the sampled wave to encode a 1/10 second silence in a song.)

With consumer electronics stores offering new MP3 players every few months, and with millions of PC owners swapping music files illegally (as well as occasionally downloading them legitimately), to say nothing of the huge numbers of musical greetings that will be zapped across Cyberspace on Valentine's Day, the modern music industry is clearly built on mathematics as much as anything else. One wonders what Joseph Fourier would have made of today's applications his original mathematical analysis of waves.


Devlin's Angle is updated at the beginning of each month.
Mathematician Keith Devlin ( devlin@csli.stanford.edu) is the Executive Director of the Center for the Study of Language and Information at Stanford University and "The Math Guy" on NPR's Weekend Edition. His latest book is The Math Gene: How Mathematical Thinking Evolved and Why Numbers Are Like Gossip, published by Basic Books.