English

A Survey of Audio Coders for Electronic-Art Music

by Eldad Tsabary

Coded digital audio is undeniably a necessity in the iPod age. Not long after making the switch from records to compact discs in our personal music libraries, we have grown accustomed to a lesser fidelity via audio coding in favour of portability, accessibility, storage-space and web/pod-casting capability among other advantages. In the last decade or so, millions of music lovers worldwide have transformed their CD, record, and cassette collections into digital data on their computer hard drives (often not with the best choice of coding); various low-datarate coders, such as Real audio have allowed audio webcasts as early as the 14.4 kbps age; and satellite radios, such as XM and Sirius, allow us to tune-in to low-bitrate audio streams across the continent.

For the electronic musician, audio coding plays additional important roles: (1) often compressed-audio files are used as sound source, either field-recordings done on a minidisc or other devices, sounds downloaded from the internet, or recorded sounds shared with other musicians via email, p2p or bit torrents. (2) Some musicians use the web for live collaboration via internet telephony and messaging services (the likes of Skype and Windows Messenger). Musicians also use coded sound to (3) post their music online, (4) podcast it, (5) send it to competitions or promote it online in any other way.

This survey examines several popular audio codecs and compares their accuracy and resulting sound quality. There are numerous audio compression formats, some lossy and some lossless, each approaching the task of reducing the digital-audio file’s size differently. Lossless compression formats such as Apple Lossless Audio Codec (ALAC) or Free Lossless Audio Codec (FLAC), among others, stay true to the original PCM source data and are assessable primarily by their encoder/decoder efficiency and performance, while lossy formats should, additionally, be evaluated by fidelity and accuracy, which are dependent mainly on the format’s psychoacoustic model, time-to-frequency mapping, and bit allocation.

Most psychoacoustic models of lossy compression formats are optimized for tonal (of tones) sound material, with little, if any, instantaneous radical spectral or dynamic changes, and occupy the spectrum most prominently between 0 and 5 kHz — the rest of the spectrum being taken mostly by related partials, not independent content. The psychoacoustic models of these compression formats rely on time-to-frequency mapping, which reduces the temporal resolution in favor of higher frequency resolution, therefore causing inefficiency in coding short-attacks and transient sounds. These characteristics make lossy audio coding far from ideal for electronic art-music and sound art.

This survey explores and evaluates some of the losses in several popular lossy codecs including mp3, m4a (AAC), mpc (Musepack), oma (Sony ATRAC3plus), ogg (Vorbis), and wma (Microsoft). All sounds will be compared in a low bitrate, 128 kbps (stereo, CBR or ABR), and a high one, 256 kbps. The choice of these two is not accidental; 128 kbps (also known as near-CD quality) is presently the bitrate of files purchased on iTunes in the m4p format (an m4a with DRM protection), and is also quite common in audio streaming. Additionally, this low datarate exposes the deficiencies of the psychoacoustic models quite effectively; to accentuate these deficiencies, the sound content in this survey is designed specifically to attack the vulnerabilities of the codecs. The higher bitrate, 256 kbps, is arguably a sufficiently high datarate for audio coding while still saving a significant amount of storage space. The usefulness of datarates 320 kbps and higher is questionable because lossless compression formats average bitrates that are not much higher (about 450 kbps) without any data loss (and therefore no need for a psychoacoustic model). Additionally, 256 kbps is the default bitrate used in Sony Hi-MD minidisc field recorders in the ATRAC3Plus format, which is one of the formats examined in this survey.

Since lossy compression relies on the transformation from the time domain to the frequency domain, compromising one resolution for the other, this survey examines each domain individually.

Test 1 — Time Domain

The first test examines the transient responses of the different coders. The coders are given the simple task of coding a 0.5 Hz square wave, which switches the amplitude’s polarity twice every cycle (therefore sounding a pulse every second). The switch is immediate (one sample) in the original PCM format and the coders are examined for (1) whether they retain the sharpness of the switch and (2) whether they distort the waveform and add noise.

The images below are in a single-sample resolution and are not always representative of fidelity; they demonstrate only the polarity switch but not necessarily the pulse in its totality, the overall sound or added noise as you shall find when listening to these examples. Click the images to listen and read the comments on the right-hand column. Despite the connecting lines between the individual samples in these images, in reality there is no such graduation between the amplitude values of adjacent samples; nevertheless, some sloping will occur in the conversion to acoustic energy at the loudspeaker.

To avoid player and format compatibility problems, all audio examples were reconverted to wav format and therefore should be playable on any system.

All coders used here are the latest available versions as of May 2007.

It is preferable to listen to these examples on a pro or semi-pro system with a wide, flat frequency response. If you are using headphones, please watch the levels!

In some cases it may take a few listen-attempts to segregate the noise; however, once heard, it is almost impossible to “unhear” it.

Codec	Image	Notes
wav file (PCM), 44.1 kHz, 16 bit	(Uncompressed) 1411kbps	This is the uncompressed original signal; listen to the immediacy and dryness of the pulses; compare with the other examples.
flac	(Lossless) 9 kbps	This is an exact replica of the original sound. And yes, 9 kbps is correct; a very simple signal such as this one does not require much storage space to be accurately described in a lossless coder.

Codec	128 kbps	256 kbps	Notes
mp3, LAME, CBR			Although to a lesser extent in 256 kbps, in both cases the pre-echo is very noticeable (instead of a “tk” onset you can hear “ft”); it is also very inconsistent, sometimes being longer than others. Additionally, listen to the oscillated tones, most prominent 1378 Hz (around F6), which actually come out stronger in 256 kbps.
mp3, LAME, ABR			The average-bit-rate coding here does not demonstrate an advantage over constant-bit-rate; the pre-echo and added tones are just as noticeable.
mp3, Blade, CBR			The transient response of this encoder is even worse than that of LAME; however, it does not introduce any oscillated tones at 128 kbps the 256 kbps version introduces a very weak (nearly inaudible), extremely high-frequency tone.
m4a, ABR			This format does quite a good job. The pre and post-echos are quite insignificant in both the 128 and 256 kbps versions. The added noise is spread more widely and granulated, which makes it less noticeable than the noise in mp3 coding. Listen carefully and you’ll hear, in both cases, some very weak tones and a general weak rattle in the background.
ogg VORBIS, ABR			Despite looking nice in the amplitude time-line, this format does not deal well with the LF square wave. In both bitrates the pre-echo is long, loud, and the post echo has pitch. Also, there is a very strong 43 Hz oscillated tone (around F1) among a few others.
oma (SONY), ATRAC3plus			Unfortunately, here too the promising image is misleading. The pre-echo in both bitrates is long and loud (hear the “woosh” sound before each pulse). However, this format is by far the cleanest of sustained noise.
wma (Microsoft), WMA 10 Pro, CBR			This format does an excellent job! The pre-echo is tolerable and there is virtually no added sustained noise. The main weakness here — although only serious in the low bitrate version — is a slight post-echo, heard as a slight reverberation which changes the spatial quality of the pulse (more depth than the original signal).
mpc (Musepack), VBR	112-152 kbps	232-268 kbps	This format is quite good as well in terms of pre and post echoes but it also adds some oscillated tones — most prominently at 1335 Hz (around E6), although not as strongly as mp3 or ogg.

Additional points to consider:

The wav format itself (PCM) is not error-free; the very concept of sampling introduces distortion, primarily by amplitude quantization (round-off noise), clipping, LSB oscillation, and aliasing; however, as this format is exclusively time-domain based and does not utilize a psychoacoustic masking model, its errors are much more predictable and easily handled in the codec than the other coding methods. When possible, it is recommended to use a 24-bit resolution or higher to reduce the round-off and LSB noises (or to send them outside the audible dynamic and frequency ranges). Presently, among the lossy compression formats surveyed, only the Microsoft WMA 10 Pro supports 24 bit coding; lossless compression supports it as well.
When the audio signal is static in nature, simple, or with long silences, lossless coding may be advantageous even in terms of storage space. The lossless flac file in this example is almost X30 smaller than the (very) lossy 256 kbps mp3 format.
It appears that the best performer in this section of the survey is Microsoft’s 256 kbps WMA 10 Pro format, with its 128 kbps counterpart at second place. The 128 kbps m4a format (the iTunes format) is quite adequate for the low bitrate as well; its 256 kbps version, however, does not offer a significant improvement in this example. The performance of all the other codecs is less than adequate, with ogg being the worst in this instance. These results may not represent other signal types.
There are many other issues to consider when choosing a codec, among them are player compatibility and availability. Mp3, for instance, has been, by far, the most widely used, while clearly not the best choice for fidelity. This survey focuses on reproductive accuracy alone.

Test 2 — Frequency Domain

In the second test the coders are given the task to code a white noise signal, which was passed through a digital HPF with a 10 kHz cutoff frequency and a moderate slope; this allows us to examine how the coders cope with a spectrally irregular signal in their weaker areas.

This time the images are linear-scale spectrographs (generated on Sonic Visualiser) showing 0–22.5 kHz. The transparent window on the right-hand side provides information about the highest coded frequency bin.
Click the images to listen to the audio examples in wav format.

Image

wav file (PCM), 44.1 kHz, 16 bit

This is the original uncompressed file.

(Uncompressed) 1411kbps

flac
(Lossless) 1165 kbps
The image and sound are identical.

This is an exact replica of the original sound, and therefore so is the spectrograph. The highly random content requires a high bitrate and makes the compression ratio almost ineffective (83%).

mp3, Fraunhofer,

encoder-speed / quality

The Fraunhofer coder allows fast encoding and high quality encoding; while the fast option may be useful for streaming, you must use the high quality encoding when generating a permanent file. However, both versions strongly distort the original signal. In addition to cutting off all frequencies above 15.5 kHz, there are very visible holes in the spectral image. Listen to the “metallic” aliasing noise and to the timbral irregularities in the highest frequencies.

Note: as higher frequencies tend to sound more spatially centered, you may perceptually scan the spectrum by narrowing your spatial focus (use headphones).

128 kbps fast encoding

128 kbps high quality encoding

128 kbps	256 kbps
mp3, LAME, CBR The LAME coder does a better job than the Fraunhofer coder. The distortion is much less significant, although still audible in both the 128 and the 256 kbps versions. Listen carefully to timbral variations in the higher areas of the spectrum. The overall timbre of the lower bitrate version also sounds significantly less bright than the original signal.

mp3, LAME, ABR The average bitrate method does not provide better results here as its usefulness relies on varying the bitrate according to the complexity of the content. In this case, the sound is uniformly complex throughout the example, therefore the bitrate would not vary at all.
mp3, Blade, CBR In 128 kbps, the signal sounds surprisingly adequate; there are some audible instabilities in the high frequencies and an overall drop in “brightness” due to the coder's cutting off frequencies above 14 kHz (very low!), but overall the loss of high frequencies is worth the decrease in audible artifacts in comparison to LAME. The 256 kbps version is very similar to that of LAME but with a little more audible artifacts.

m4a, ABR This coder does an excellent job - both in the low and high bitrates. In both cases the distortion is practically inaudible, and the overall timbre sounds identical to the source. A good ear may pick up very subtle timbral variations in the highest area of the spectrum. The main advantage of the higher bitrate is its full-spectrum coverage; however, cutting-off frequencies above 17.3 kHz (in 128 kbps) is not a problem for most users.

ogg VORBIS, ABR The 128 kbps ogg file covers a wider area of the spectrum than its m4a counterpart (almost as high as 19 kHz) but with less accuracy. The spectrograph displays a less gradual decrease in intensity towards the lower frequencies, therefore a distortion of the HPF's original slope; as seen in the spectrographic image, the division into frequency bands is quite coarse. A low level aliasing distortion makes the overall sound a bit more “metallic” and less pleasant. The 256 kbps is extremely close to the original sound, but a careful ear will notice a subtle timbral difference, the original signal sounding a little “fuller”.

oma (SONY), ATRAC3plus The 128 kbps SONY format cuts off at 15.3 kHz, which, like the Blade mp3 format, is perhaps too low for professional uses; the 256 kbps (the minidisc recording format) covers the entire audible spectrum. Overall this format deals well with this complex signal. In both cases aliasing and other forms of distortion are not a problem. The lower bitrate is less "bright" due to the loss of frequencies above 15.3 kHz; the 256 kbps version is nearly perfect.

wma (Microsoft), WMA 10 Pro, CBR Like the SONY format, this Microsoft coder cuts off at 15.3 kHz in the 128 kbps setting and covers the full spectrum in 256 kbps. Unlike in ATRAC3Plus, aliasing is a significant problem in both bitrate settings, as is a spatial distortion caused by reverberation. While this format performed superbly in the time domain, it adds very audible artifacts here.

mpc (Musepack), VBR Although to a lesser extent in the higher bitrate, both settings introduce some timbral instabilities in the high frequency range and a very high-pitched metallic ring (unpleasant).
112-152 kbps	232-268 kbps

Additional points to consider:

There are several mp3 coders, primarily LAME, Fraunhofer, Xing, and BladeEnc, the first two being the most widely used — Fraunhofer because of being first and LAME as a result of being free. This survey did not include any examples encoded by Xing as it is the encoder with the most problems (hear a Xing example in 128 kbps).
The best performer in this test is the m4a codec (in both bitrates). Among the surveyed coders, it introduces the smallest amount of audible artifacts and it keeps the truest to the original timbre; a fairly close runner-up would be SONY ATRAC3Plus. Ogg Vorbis does a fine job in 256 kbps, but all the other coders in this test distort the signal and introduce significant audible artifacts. The mp3 format is the worst of all.
In all low-bitrate coded signals, the coders cut a chunk off the high frequencies. While cutting off 6 kHz of our hearing range in the Blade-encoded 128 kbps mp3 (assuming 20 kHz as the optimal hearing limit) is a significant loss for the overall sound, it is not as big a loss as a linear spectrograph may suggest. While in our logarithmic pitch perception, the frequency range 20–6000 Hz represents a pitch range of more than 8 octaves, the 6 kHz from 14 kHz to 20 kHz amount for (approximately) a triton in an area of the spectrum where our perceptual sensitivity to loudness and pitch is, at best, extremely limited.

Final Conclusions and Recommendations

Make lossless coding your first choice! If your sound is spectrally simple, static in nature, or has long periods of silence, lossless coding may even save you storage space when compared with lossy formats. On the other hand, if your sound is spectrally complex, includes many transients, or both, the loss in fidelity may be too great a price to pay for the space saving offered by lossy coding.
In instances where you must compress your sound, m4a (AAC) would be your best bet among the coders surveyed here — both in its low and high bitrates; its performance was extremely effective in both tests. Additionally, being the iTunes format, m4a is widely used, strongly supported and frequently updated.
Mp3 is not among the strong performers of this survey, but its history, global player-compatibility, and accessibility give it dominance. In many instances you will be asked to provide your works in mp3 format; you must choose your coder and settings very carefully. Among the surveyed encoders, LAME produced the best (least bad) results, with one exception: the low bitrate encoding of a spectrally complex signal which was, perhaps, a bit better with Blade. If you are forced to use 128 kbps, and especially if your digital sound is not rich in transients, you may be better off with Blade instead of LAME. The only way to be certain, naturally, is to try them both.
This survey did not examine the effectiveness of using a VBR coding. As a general rule, allowing an encoder the flexibility of varying the bitrate may prove useful on sounds with large dynamic and spectral ranges (such as classical music, experimental music, electronic art-music, sound art, etc.); however, this additional decision-making process may also introduce new problems. Most contemporary codecs, nonetheless, allow effective usage of both CBR and VBR; some codecs use VBR or ABR (a variable bitrate encoding with a preset average rate) by default; among them ogg Vorbis, m4a, and mpc.
When recording with a Hi-MD minidisc, use PCM recording when possible, especially if you are recording a signal with many transients; however, if you must record continuously for a period longer than 1.5 hours (the time-limit on a 1GB disc) you may use ATRAC3Plus at 256 kbps (Hi-SP), which is quite effective throughout the entire audible spectrum.
A quick look on the tests’ results shows no apparent advantages to using the Sony, Musepack and Vorbis formats over m4a. The Microsoft format demonstrated a slight advantage in the time domain test over the m4a but with a bigger penalty in the frequency domain. Nonetheless, if your signal is mostly made out of transients with little sustained or reverberated sounds you may find the WMA 10 Pro format advantageous.
These two tests attack the codecs’ vulnerabilities in order to demonstrate how the codecs deal with the highly-irregular content of electroacoustic music as generally and as objectively as possible. However, each sound work may introduce its own “issues” and your best option would be to try-out several coders and evaluate their performances in the time and frequency domains.

Recommended Bibliography

Bosi, Marina and Richard E. Goldberg. Introduction to Digital Audio Coding and Standards. Boston, MA: Kluwer Academic Publishers, 2002.

Guzmán, Luis E. G. “Compresión Perceptual de Audio Digital.” Tesis de Maestro en Ciencias en Ingeniería Elélctrica, Universidad Michoacana de San Nicolás de Hidalgo, November 2005.

Hauge, Knut O. and Svein B. Skogly. “Analysis of Audio Coding Algorithms for Networked Embedded Systems.” Masters Thesis in Information and Communication Technology, Agder University College & The University of Queensland, May 2004.

Tsutsui, Kyoya, Hiroshi Suzuki, Osamu Shimoyoshi, Mito Sonohara, Kenzo Akagiri, and Robert M. Heddle. “ATRAC: Adaptive Transform Acoustic Coding for MiniDisc.” October 1992.

Various. “Discussion of Audio Compression.”
A collection of documents posted by individuals, including technical data, tests, reviews, recommendations, and general chatter.

Various. “MP3 Tech.” 1996.
A site dedicated to the mp3 standard, including information about upcoming audio compression techniques, tests, source codes, and more.

With thanks to Kevin Austin and Tim Ramsay for their useful suggestions.

20 May 2007

eContact!

eC!

Social top

A Survey of Audio Coders for Electronic-Art Music

Test 1 — Time Domain

Test 2 — Frequency Domain

Final Conclusions and Recommendations

Recommended Bibliography

Social bottom