Doom9's Forum - View Single Post

Koorogi · 25th February 2009, 06:30

Lossy audio (and video) codecs typically work in the frequeny domain, because its easier to take advantage of several features/limits of human perception.

Audio codecs typically use a modified discrete cosine transform. This tells them the amount of each frequency that's present in a series of very short, overlapping time slices. Within each timeslice, they can make decisions about which frequency detail is important to keep. Human hearing has some masking effects - we are not as sensitive to sounds (especially of similar frequency) which are temporally nearby or concurrent to louder sounds, for instance. Audio codecs take advantage of this to throw out information that they don't think will be noticed.

This process can be done with any bit depth or time resolution. Increasing the bit depth increases the precision possible (it's possible to record finer changes). It is of course possible that this extra precision will be thrown out by the encoder due to bitrate pressure to encode more important data. I imagine bit rates higher than 16 bit are more useful for studio work where you shouldn't be using lossy compression anyway. Higher sampling rate, however is more important. Increasing the sampling rate does two things - it increases the frequency range expressible. The Nyquist limit states that the sampling rate must be at least twice the bandwidth of the signal, so if your signal has frequencies from 20 Hz to 20KHz, you need at least roughly a 40 KHz sampling rate to reproduce it.

The other thing it does is gives the encoder more accurate frequeny information. The discrete cosine transform and other fourier-based transforms have the problem that the frequency information you get is uniformly distributed throughout the frequency spectrum, but that doesn't match how we see and hear. Since every octave change sounds like the same "distance" to us, but is actually a doubling in frequency, we are more sensitive to changes between low frequencies, so frequeny resolution is more important here. But in order to get better frequency resolution they would need to perform the transform on a larger block of data at a time, which leads to other problems (latency, higher memory and processing requirements on the decoder, harder to take advantage of some properties of hearing due to a corresponding decrease in temporal resolution). These are actually the problems that motivated the development and use of wavelets in codecs like JPEG2000 and Snow.

25th February 2009, 06:30	#7 \| Link
Koorogi Registered User Join Date: Feb 2009 Posts: 8	Lossy audio (and video) codecs typically work in the frequeny domain, because its easier to take advantage of several features/limits of human perception. Audio codecs typically use a modified discrete cosine transform. This tells them the amount of each frequency that's present in a series of very short, overlapping time slices. Within each timeslice, they can make decisions about which frequency detail is important to keep. Human hearing has some masking effects - we are not as sensitive to sounds (especially of similar frequency) which are temporally nearby or concurrent to louder sounds, for instance. Audio codecs take advantage of this to throw out information that they don't think will be noticed. This process can be done with any bit depth or time resolution. Increasing the bit depth increases the precision possible (it's possible to record finer changes). It is of course possible that this extra precision will be thrown out by the encoder due to bitrate pressure to encode more important data. I imagine bit rates higher than 16 bit are more useful for studio work where you shouldn't be using lossy compression anyway. Higher sampling rate, however is more important. Increasing the sampling rate does two things - it increases the frequency range expressible. The Nyquist limit states that the sampling rate must be at least twice the bandwidth of the signal, so if your signal has frequencies from 20 Hz to 20KHz, you need at least roughly a 40 KHz sampling rate to reproduce it. The other thing it does is gives the encoder more accurate frequeny information. The discrete cosine transform and other fourier-based transforms have the problem that the frequency information you get is uniformly distributed throughout the frequency spectrum, but that doesn't match how we see and hear. Since every octave change sounds like the same "distance" to us, but is actually a doubling in frequency, we are more sensitive to changes between low frequencies, so frequeny resolution is more important here. But in order to get better frequency resolution they would need to perform the transform on a larger block of data at a time, which leads to other problems (latency, higher memory and processing requirements on the decoder, harder to take advantage of some properties of hearing due to a corresponding decrease in temporal resolution). These are actually the problems that motivated the development and use of wavelets in codecs like JPEG2000 and Snow.