Hi and welcome to our special module on MP3 Audio Encoder. MP3 is a shorthand for MPEG layer 3, MPEG is a shorthand for the Motion Picture Expert Group. And what this all means is that at one point in the 90s a lot of people, a lot of experts got together and agreed on a set of standards for video and audio compression and encoding. MP3 turned out to be the most used audio digital format for additional audio storage, streaming and playback. And today a portable music device, thanks to the MP3 encoding, can store up to 30,000 songs which really means you can carry your entire music collection with you everywhere. So in this video we will look at the technology behind the success of MP3 and we will describe in detail how the MP3 encoder works. You will see how older tools that you have learned in our DSP class from free transform to filtering from soundly into quantization they all come together in this application. So how does the encoding and decoding process take place? Suppose you start with a discrete time, sound, signal, x of n. This is processed by the encoder and converted into a binary string. The decoder will take that binary string and convert it back into the sound signal y[n]. The goal of encoding and decoding chain is to reduce the memory requirements to store the sound wave form. And the real achievement of MP3 is its ability to greatly reduce the amount of data needed to encode a file. And this, at a very reasonable tradeoff, with respect to sound quality degradation. The data reduction is determined by looking at the amount of memory that is necessary to store the output of the encoder and by comparing this quantity to the amount of memory if we have wanted to store the original signal in an uncompressed format. And remember that an uncoded raw audio file will require quite a bit of storage. For instance, if we sample at 48 kilohertz which is the DVD standard, and we use 16 bits per sample, we will need 12 megabytes to store a single minute of audio and stereo. On the other hand, a high quality MP3 will require just 1.5 megabytes which represents almost an order of a magnitude in data reduction. To achieve this performance, the coding has to be done in a very clever way. And one of the key ingredients in MP3 is a model of the human auditory system. So MP3 does not attempt to preserve the original framework, but rather it focuses on coding the elements of the way form that are most important to our way of listening to music and hearing sounds. In particular the distortion introduced by the encoder, the loss of information introduced by the encoding mechanism is placed in parts of the spectrum of the original signal that we cannot hear. We will see that in more detail in just a minute. As we said, the origins of MP3 date back to the 90s, when the Moving Picture Expert Group, in short, MPEG, was set up by the international standard organization to develop algorithms and standards for audio and video compression. And the audio compression part of the standard, the MP3 protocol had its origins in a set of compression algorithms that had been developed in the 80s by the Hannover Institute in Germany. We see a photo of the team here in this picture, the MP3 standard was quickly embraced by the industry And this wide spread acceptance is what decreed its success, ultimately. Now let's try to understand how MP3 works using this simple block diagram. Your input signal, x of n, enters a bank of subband filters. There are thirty-two parallel filters that subdivide the input signal into 32 independent channels that span the full spectral range of the input. Each channel is then quantized independently using a very clever method and the quantized sample are then formatted and encoded in a continuous bit stream. The quantization scheme is clever because the number of bits allocated to each sub-band is dependent on the perceptual importance of each sub-band with respect to the overall quality of the audio wave-form. In other words, subbands that are deemed by the Psycho-Acoustic Model not to be important or difficult to be perceived are allocated very few or no bits at all. Whereas, the most perceptually relevant subbands are allocated the bulk of the entire Bit budget. The reason why we can say flee allocate different amounts of beats to the different subbands is to be found in the so-called masking effect of the human auditory system. Supposed you have a sound with a strong component as in this picture here. The blue line represent the spectrum of the sound. And here with the red dot, we indicate the strong sense of a component. When your ear listens to a sound like this, a masking effect takes place whereby frequency components in the vicinity of the dominant peak Are not heard unless they are louder than a given masking threshold. In this figure, for example, the masking threshold is indicated by the red dotted line and what it indicates that anything in the spectrum that falls below the red line will not be heard and therefore can be removed without any loss of perceptual quality. Masking effect is something that we experience everyday. Imagine being in a perfectly quiet room like in your home at night, you can even hear your wristwatch ticking. But of course, you wouldn't be able to hear that noise in normal conditions during the day when a lot of other auditory stimuli are reaching your ears. Although if you were to record the audio environment and analyze its spectrum, you will see that it still contains the information about your wristwatch ticking. The shape of the masking threshold is a function of the loudness and the frequency of the dominant tone, and it has been determined experimentally by running a lot of listening tests with human subjects. Masking in the human ear takes place within critical bands, and critical bands are portions of the spectrum that are treated by the ear as a single unit. Everything that happens within a critical band can now be further resolved by the ear. So two different frequencies taking place in the same critical band are perceived as a single tone. There are approximately 24 critical bands in the human ear. And here is a picture of their distribution and frequency. As you can see, they get wider as we go up in frequency. They follow a logarithmic scale, which means that the resolution power of the ear is stronger at low frequencies, whereas at high frequencies were less discriminant. And therefore, when we quantize things across critical bands, we can probably fit more noise in the high frequencies that in low frequencies. In the end, the purpose of the psychoacoustic model is to compute Compute the minimum number of bits that we need to use to quantize each of the 32 subband filter outputs, so that the perceptual distortion is as little as possible. In the end we're given a non-uniformed bit allocation which will allocate fewer bits to the bands where the masking is strongest. Interestingly enough, the specifications of the psychoacoustic model are not part of the MP3 standard, which means that manufacturers of MP3 encoders can compete with better and better versions of their psycho-acoustic model. In the end, the number of bits used for each sub-band is sent along with the quantized data to the decoder, so it doesn't really matter how this bit distribution has been generated. From a technical point of view, as you can imagine there are a lot of fine details in the inner workings of the psycho-acoustic model and the bit allocation procedure. And we will not have time to examine all of this in this presentation. But we can roughly sum up what happens inside a psycho-acoustic model like SAM. First of all, remember that all processing is performed on subsequent Windows of a given length, so the input signal comes in, and the stream of input samples is cut into chunks of a given length, say 1024 samples. An FFT is then used to estimate the energy of the signal In each of the subbands computed by the filter bank. First subband, we try to distinguish between tonal and non-tonal components, components that have a strong sinusoidal shape and noise-like components. We have looked at masking for total components, but a similar type of masking takes place for non-tonal components. And we will have to take that into account as well. The individual mask in effect for tonal and non-tonal components is computed for each critical band. And then these results are summed together to obtain a global masking curve for the audio frame that we're analyzing. This masking curve is mapped on to the 32 subbands, and the number of bits that we will use for each event is computed as a function of the signal to mask ratio. The power of the signal versus the masking power for each critical band. Let's now talk about the implementation of this subband filtering in MP3. As we said, the input is split across a filter bank that contains 32 filters isolating different parts of the spectrum. These filters are implemented as 512 tap FIR's, and they're followed by 32 times down sampler to provide the independence of band samples. The filter prototype is a simple low pass with a cut off frequency of pi over 64 and a total bandwidth of pi over 32. The different sub bands are obtained by modulating the base filter with a cosine at multiples of pi over 64 and the resulting filter bank looks like this. We're showing the positive half of frequency access. This would be the first low pass filter with the next zero. This is the second one, the third, the fourth, and so on covering the entire spectrum. Now let's go back to the implementation of the filter bank. As you can see here, from this block diagram, each branch in the filter bank comprises an FIR filter of length 512 and a 32 time down sampler here. What this means of course is that 31 out of 32 output samples of this filter are discarded, and so this is of course a very wasteful implementation. Let's try and make this a little bit more efficient, this is actually explained in the MP3 standard We start with the equation that expresses the output of the Subband number i as a convolution of the impulse response of the filter for that branch with the input. And here you see that the down sampling factor translates to a factor of 32 in front of the input index. We can now replace the expression for the impulse response of the filter as the prototype impulse response tomes the modulating factor that brings the filter to the proper position in the frequency band. And then here we're going to apply a little trick, we're going to express the index k as the sum of two indices. Namely we're going to say the K is equal to 64 times an index p + q. Where q ranges from 0 to a 63 and p ranges from 0 to 7, okay? So with this split of the summation, we can write the previous line as a double summation for p that goes from 0 to 7 and for q that goes from 0 to 63 as the same term, the modulation term that was seen before. The prototype impulse response and the input. Where we have, again, we have made the substitution, K = 64p + q. Okay, so with this trick, we can actually simplify the first term of this double summation. Consider the cosine term. We can write cosine of pi over 64. Times (2i + 1) times 64p plus some other term, let's call it f(i, q)) And we don't really care about that. Now here, 64 is canceled out and we're left with cosine 2ip pi + P pi + this term. And now, well this is a multiple of 2 pi, so it doesn't influence the angle. And here we have a multiple of pi and we know that cosine of pi + alpha is equal to- Cosine of alpha and so in the end what we can do is simplify this cosine as cosine of pi over 64 times two i plus one, times q minus 16. and add a term minus one to the power of p. That we can move over to the second summation. And we have a simplified quote unquote expression that looks like so. An outside sum here that only involves the cosine modulation, and an inner sum here which is a pre sub-sampled implementation of the filtering operation. If we work out the indices and convert this to an algorithmic procedure, this is what we need to do. We will use a 512 tab input circular buffer, and we will shift at each step thirty-two new input audio samples Starting from the newest. So at anytime the circular buffer is holding 512 input samples in time reversed order. Then we take a new 512 point buffer and we fill it sample by sample with the product between the prototype impulse response and the content of the circular buffer. Next, We compute this intermediate quantity here which is the sum of the contents of this new buffer 64p apart. We can do that for 63 different points, And if you do the math, there are 7 points that we have summed together for each q index. Finally, each subband output is given by this sum here, well we're taking the intermediate quantity, c of q that we computed before, and we modulate it with the cosines at the frequencies that we have defined in the beginning. And finally quantization, this is where the great bit rate savings are going to be achieved. MP3 uses uniform quantization of subband samples. And the number of bits per sample in each subband is determined by the psychoacoustic model as we explained before. We also said before that MP3 works on Subsequent audio frames, a frame being a window of input samples that is processed independently. There are 36 samples per band and per frame in the MP3 standard. And so Since all of the 36 samples is going to be quantized by the same quantizer, a rescaling is needed so that we're using the full range of the quantizer. Remember how uniform quantization works A quantizer maps at input interval to a set of quantization levels. Of course you have to make sure that the range of your input signal matches the range of the quantizer. For instance, this quantizer expects the input to range from -1 to one But if your actual input only lives in this small sub-interval, you will not be able to make use of the full quantization range. So re-scaling normally would imply perfect renormalization of the 36 samples. By dividing the samples, by the largest sample in magnitude. Of course in order for the decoder to then reconstruct the actual levels of the input, we would have to send this normalization factor along side with quantized data. But this would require a lot of side information. We would use 16 or 32 bits between code and normalization factor Instead, the MPEG standard defines 16 predefined scale factors, we would choose the one that best matches the actual range of the input and only use four bits to communicate this range to the decoder. Thanks to the fact of these predefined levels are set in stone. Finally, the actual quantization is performed according to this formula where b is the number of bits as provided by the psychoacoustic model, and Qa and Qb functions of the number bits are parameters that are encoded inside the MP3 standard. Finally, let's listen to some examples. We all know that MP3 works very well, so what we want to concentrate on here is the importance of the variable bit allocation across the sub-bands as performed by the psychoacoustic model. So for a fixed bit budget we could choose to allocate the same number of bits to all sub-bands. This would be uniform bit allocation or we could use a psychoacoustic model and allocate the bits smartly across sub-bands. And so here are the examples starting with the original signal [MUSIC] Now let's listen to the same signal encoded with uniform bit allocation [MUSIC] And finally this is the result of a full pledge MP3 implementation with psychoacoustically based bit allocation. [MUSIC] Of course in both the uniform and non uniform beta location and coding schemes, the target bit rate was very very low in order to exacerbate defects of quantization. But the principle holds for all bit rates.