PDA

View Full Version : Audio Codec Comparison with Correlation


AMTuring
18th April 2005, 18:49
This study should be of interest to those who compress audio for their movies using variable bitrate programs. It is also of interest for those who compress their music.

Be warned that this is a mathematical analysis and is not recognized by many in the audio compression community as a valid method. Caveat emptor.

I compared 7 Audio Codecs, Apple (iTunes), LAME (mp3), MPC (Musepack), Nero Audio, Ogg Vorbis, Real Audio, and Windows Media Audio.

The correlation based measures of fidelity showed that Ogg, Apple, LAME and MPC were the best codecs. Which one is the best depends on the criterion used for comparison.

MPC gave the best "minimum correlation." Ogg gave the best P99.9 (defined as the correlation that 99.9% of the clip falls above. However, LAME was not far behind, and Apple was better than MPC on the P99.9 comparison.

WMA was the worst of the 7 codecs.

For details of the analysis download the study from Codec Comparison (http://home.broadpark.no/~clarkt/Codec_Compare.pdf).

Comments or suggestions are welcome.

johnman
19th April 2005, 14:42
I havent read the whole thing (yet) but it looks interesting.
If you already answered this remark in your paper you can disregard my question.

You say the ultimate goal is to produce a procedure that could be used in place of human testing.

Im not realy into this stuf but isn't the psychoacoustic model used an important part of a codec ? I wonder if it is possible to test such a model mathematicaly. Can you give some comments on this?

AMTuring
19th April 2005, 17:59
Thanks for your comments. That is an important question that I have not evaluated fully. I need to really dig into the inner workings of mp3 etc. to answer it correctly. But here are my thoughts on the subject:

First, More research is needed on this subject. I do not claim to have all the answers. I just claim to be searching for answers. To anyone reading the paper, remember this method has not stood the test of time and listening tests are still the gold standard.

However, I believe that if a computer program can compress audio that is transparent to the human ear, it must eventually be possible to design a computer program that can measure how well this has been done.

As I understand it the psychoacoustic model says that you can drop certain parts of the signal when it is a small part of the overall signal. I have seen it described as a low amplitude frequency superimposed on a larger amplitude different frequency. You can drop the lower amplitude signal and not hear the difference.

Correlation works in the same way. So long as the majority of the energy in the signal is represented in the compressed version the correlation will be high. Correlation will only drop a lot when a significant fraction of the energy is dropped. Described in this simplistic way the two ideas sound compatible.

The twist that is in audio compression that I have not taken into account is the threshold of hearing vs. frequency. This is a highly non-flat curve. Audio compression uses this to redistribute bits between the frequencies. More bits are used in the 1-3 kHz range where the ear is most sensitive, and fewer bits elsewhere.

This could actually be included in the analysis by filtering the correlations to simulate the human ear's sensitivity. However, I don't think this is necessarily a good idea. The artifacts that occur in compressed audio are across the entire spectrum. If I weight down the frequencies outside 1-3 kHz then I could mask something that a listener could detect when other frequencies are not present.

Programming this model would take a lot of effort, but I know how to do it. After I study more about how compression works I will try it out.

However, who's hearing should I simulate? No two people are identical. If I use the one that is built into LAME (or any other program) won't I prejudice the results in favor of that program? Anyway it is a lot to think about.

Finally, it is important to note that the lack of frequency bias in my program means it could produce false negatives (identify a clip as bad that a listener would think is OK). But it does not lead to false positives. A high correlation value will also pass a listening test.

Thanks,

Turing

johnman
19th April 2005, 21:52
that is a long and eloquent answer :)

I think it would be great when it was possible to mathematicaly prove which codec is best. Even when it doesnt succeed, it may help to get better understanding to the perceived quality etc..

But to be honest i dont think it is ever possible for a couple of reasons.

- The human hearing organ is VERY complex. I dont think it is possible to get all that complexity into a formula.

- As you mentioned, different person hear different things, and when you are talking about the general hearing capabilities, well.... this is not exact science anymore :)

- You say that "it could produce false negatives (identify a clip as bad that a listener would think is OK). But it does not lead to false positives." Isnt it possible for 2 different files to have the sane correlation AND still not sound identical ?

Although there are some pretty big obstacles i think this research is very interesting :)

What i was wondering about a long time ago is to search for almost identical files on a computer. Wouldnt it be possible to use you correlation algorithm to find clones of the same song?? I wanted to put this in my program a long time ago, but at the time i didnt know any good algorith for this purpouse.

An other thing which comes to my mind right now is to somehow use a neural network to test the quality and to combine the results somehow with your correlation.... This is not something you can do on a rainy afternoon, but when the results are becoming more and more serious it might be a good way to finetune the algorith for even more reliable results.

pieroxy
20th April 2005, 15:25
I think this research is flawed from the beginning. :( You are looking at a formula which would say how close two files sounds from one another. This already exists: It is called a psychoacoustic model (although it is not expressed in this form).

To simplify:

Psychoacoustic model says: "This frequency can be dropped because it is the less audible of the spectrum"

Your formula would say: "This file sounds very much like this other one because none of the 'too audible' frequencies are missing"

So all in all, there is not much difference between both programs. They just express the same truth in a slightly different way: classifying frequencies from more to less audible. So if you tune your program well with - say - LAMEEnc, it will be very close to the lame psychoacoustic model and everything else will fail miserably against an obviously flawed equation.

Of course, I simplified a great deal.

Or it could be that I just didn't get it. :confused:

AMTuring
20th April 2005, 18:22
You may be right that the approach is flawed, but I find the research interesting. If no one ever tries something like this we will be waiting on HA for a long time.

I think I said the same thing in a different way. But my point was that I don't want to tune my program to favor a particular psychoacoustic model. That is why I have not gone down that road.

Instead I am currently pursuing multivariate analysis of my results vs. listening tests. My goal is to reproduce the human ear, so that is the criterion I am working on. I am gambling that I don't have to filter the results to further mimic the human ear.

So far, this analysis confirms that the P99.9 is the most significant measure of quality in order to match listening tests. However, is that the best variable? I have only looked at a few.

The P99 and average do not turn out to be good measures, but this is very preliminary. I am going to try and repeat a recent listening test and then do some more analysis.

AMTuring
20th April 2005, 18:31
@johnman

You could modify the align program in the appendix to search for a particular subclip in another song. Right now it is hard wired to a 100 ms (4410 samples) clip and it searches in both directions (advance or delay) for the best correlation (up to 2000 samples).

If you are proficient with C it could be changed to do what you want. You need to force it to search the entire second file for the clip found in the first file (remove the 1000 ms limit). Also you only need to search forward which is the sub function called by the main align function.

The program would tell you the delay into the second file that give the best correlation. If the correlation is good (say more than 70%) you will still need to listen to see if it is the right one. However, you could probably skip low correlations.

Because of the phase amiguity some clips may not correlate well if they have been modified with effects filters. As discussed in my LAME paper, a 90 degree phase shift will sound the same, but will give a zero correlation. For codec analysis I have verified that this is not an issue.

I would recommend you confine the approach to clips that are small enough to fit in a memory array.

Thanks for your interest,

Turing

AMTuring
22nd April 2005, 06:17
Since I wrote the study, I have put together all 18 of the clips from Roberto's multi-codec 128 kbps listening test. My procedure reproduces those results except for the iTunes result. It come out better with my correlation analysis. However, the version I am using is newer and it may have actually improved.

I also found that LAME 3.96.1 produces better results than 3.96. If I use 3.96 I get the same ordering of codecs found in the listening test. Ogg Vorbis and MPC are first, followed closely by LAME and iTunes). WMA is the lowest. If I use 3.96.1, LAME is tied with Ogg and MPC for first.

Statistical analysis using a t-test showed that of the parameters I have examined that P99.9 and minimum correlation show significant linear relationship with the listening test ratings. The P99.9 value shows the best individual correlation. According to the t-test the P99.9 value reproduces the listening tests for all 18 songs and all of the Codecs (excepth ATRAC3, I don't have that one) with a 97.4% chance that the result is not by chance.

I will put together the details in another study next week.

johnman
24th April 2005, 03:15
I forgot to mention it earlier, but the results which i saw in the pdf and the latest results which u seem to have are in line with my expectation. For me this seems to be very promising. If your analysis is wrong, its very unlikely you get the "corect" results. Im looking forward for the details :)

AMTuring
30th April 2005, 12:52
I have been doing a lot of reading regarding statistical analysis. My previous post was a little exagerated. t-tests are not appropriate for the kind of comparison I am doing between listeing tests and correlation studies.

My analysis is now published and can be downloaded from Calibration of the method to listening tests. (http://home.broadpark.no/~clarkt/Calibrating.pdf).

Once again I am wrong. Many months ago I used average correlation to judge codec quality. This is clearly not correct as discussed in my LAME Codec comarison post here on doom9.

I then switched to P99.9 (99.9% of the correlations are larger than this value). However, a proper statistical analysis shows that P95 is the best (95% of the correlations are above this value).

Using P95 I was able to reproduce the overall Amorim Multi-format Listening tests within 90%. I think this is still pretty impressive.

I have finished a new multi-codec comparison using P95. I also merged the two sets of test clips into a set of 35 songs to test. I will try to get this published by next week.

sterlina
4th May 2005, 21:38
very interesting study!
a question: does the fact that P95 is better than P99 to compare to human ears is because of... "human imperfection"?

AMTuring
5th May 2005, 14:39
@sterlina

There could be some elements of human imperfection in the result, but I think it is more to do with the length of the audio clips used in these kinds of tests. For copyright and sanity reasons they are always less than 30 seconds. Many are 20 seconds.

1% of a 20 second clip is 0.2 seconds (200 msec). I think P99 would show a better correlation if the clips were longer. The same kind of reasoning says that the P99.9 is redundant with the minimum value (they both measure practially the same thing).

I am working on a new 8 codec comarison paper using P95 as the criterion. It is taking me a long time because I have been trying to do it right. I discovered that Tukey HSD analysis used for comparison during listening tests is not the right way to handle this kind of comparison. The Tukey HSD test assumes equal variance between the codecs. I found the variance to vary by as much as 10 times between codecs. The appropriate comparison tool is the Games-Howell HSD. It corrects for different variances.

A preview of the results: Nero AAC beats all the other codecs at just about every bitrate setting. The only one it looses is the LAME --preset standard comparison. Where LAME and Real Audio produce files around 200 kbps, Nero's files are around 180 kbps. When the --preset extreme comparison is made Nero wins again. Here is a summary:


VBR around 96 kpbs

Nero is better than Real HA WMA iTunes LAME Ogg MPC
Real is better than LAME Ogg MPC
HA is better than LAME Ogg MPC
WMA is better than Ogg

VBR around 128 kpbs

Nero is better than MPC Ogg LAME HA WMA Real
MPC is better than HA WMA Real
iTunes is better than Real
Ogg is better than Real
LAME is better than Real

VBR around 160 kbps

Nero is better than Ogg HA Real
MPC is better than Ogg HA Real
WMA is better than Ogg HA Real
LAME is better than Ogg Real

VBR around 192 kbps

Real is better than Nero Ogg iTunes MPC WMA
HA is better than Nero Ogg iTunes MPC WMA
LAME is better than Nero Ogg iTunes MPC WMA
Nero is better than MPC WMA

VBR around 256 kbps

Nero is better than WMA LAME Ogg MPC iTunes HA
Real is better than Ogg MPC iTunes HA
WMA is better than MPC iTunes HA
LAME is better than MPC iTunes HA
Ogg is better than MPC iTunes HA


These are all 95% confidence differences using the Games-Howell method. HA is short for the LAME 3.90.3 program from HydrogenAudio. LAME is the LAME 3.96.1 version from the LAME project at sourceforge.

This comparison used a combination of clips from Guruboolez's LAME listening tests and Roberto Amorim's 128 kpbs multi-codec comparison. There were 35 clips in all.

The biggest surprise to me was the poor performance of the HA version of Lame on the exteme setting. I have seen signs in earlier tests that this was true, but so many people swear by this version and setting that I have looked at it very carefully. In a sample-by-sample t-test comparsion between 3.96.1 and 3.90.3, 3.96.1 is better than 3.90.3 with a .02% probability that this arose by chance! Thats right, 2 100ths of a percent.

Fortunately the HA guys don't believe in mathematical comparisons, so they can ignore this result.

Gabriel_Bouvigne
9th May 2005, 10:18
Fortunately the HA guys don't believe in mathematical comparisons, so they can ignore this result.
Neither psychoacoustical encoder developpers...

...however, your scheme would work well for codecs not using any psychoacoustic model: ADPCM, Wavpack hybrid,...

AMTuring
12th May 2005, 06:49
Originally posted by Gabriel_Bouvigne
Neither psychoacoustical encoder developpers...

...however, your scheme would work well for codecs not using any psychoacoustic model: ADPCM, Wavpack hybrid,...

I think the key word is "model." Remember that the psychoacoustic models are not an implementation of the human hearing system.

In any case, my studies indicate that LAME 3.97 alpha 10 will be the best version yet. Thanks for your tireless effort on this great program.

AMTuring
18th May 2005, 18:29
To those who think that the Psychoacoustic model is very different than the correlation technique used here: I think you are incorrect.

I have been studying the PA model some more and found the following facts (refer to Simon Fraser University Audio Compression Theory (http://www.cs.sfu.ca/CourseCentral/365/li/material/notes/Chap4/Chap4.4/Chap4.4.html) during the following discussion. It has some simple illustrations that may make these topics easier to understand) :

1. A person with good hearing has a 20 Hz to 20 kHz range, but the sensitivity to low volumes outside the 2-4 kHz range falls off rapidly. SFU sensitivity curve (http://www.cs.sfu.ca/CourseCentral/365/li/material/notes/Chap4/Chap4.4/Topic6.fig_66.gif)

2. The closer two pure tones are to each other the higher the volume of the weaker tone needs to be to be detectable. This is called Frequency Masking in the lossy codec literature. SFU Frequency Masking Curves (http://www.cs.sfu.ca/CourseCentral/365/li/material/notes/Chap4/Chap4.4/Topic6.fig_72.gif)

3. After a loud tone ceases it takes a while to hear a weaker tone that was masked by the louder tone. This phenomenon is used to continue to mask channels even after the dominant channel has dropped in amplitude. This is called Temporal Masking. SFU Temporal Masking (http://www.cs.sfu.ca/CourseCentral/365/li/material/notes/Chap4/Chap4.4/Topic6.fig_73.gif)

At first glance these appear quite different from the correlation technique used here. However, they are actually related (at least in the case of 2 and 3.

Fact one is used to redistribute bits amongst frequency channels. Since my correlations do not do this there will be a random noisy difference between the results of a correlation technique that took this into account and my correlation values. However, the average values will converge to the same value using the law of large numbers.

Fact 2 is actually just a statement in digital language of the analog phenomenon that a finite time function must have a finite frequency resolution (the Uncertainty Principle). This is introduced into my analysis by using a windowed autocorrelation value that is adjusted to match known human hearing specifications.

The difference between the measured human response and my analysis is that the resolution is constant on a log-frequency scale in the psychoacoustic models. Windowed autocorrelations produce a constant resolution on a linear-frequency scale. In other words, the psychoacoustic models imply a frequency dependent window function instead of the frequency independent window function used in my analysis. I plan on pursuing this avenue to see if I can improve the correspondence between this approach and listening tests.

In spite of the imperfect match between these resolution responses, fact 1 mitigates this problem. Since the human response is dominated by what happens in the 2-4 kHz range the difference between a constant resolution and a log-constant resolution is not huge over a single octave. So it is not clear how important this difference really is.

Fact 3 results directly from using a causal window function for the autocorrelation functions. The “memory” of the previous tone persists for some time as the windowed autocorrelation function slides forward in time. Hence temporal masking is a direct result of any causal windowed autocorrelation analysis.

To sum it up: The psychoacoustic model is not as incompatible with this correlation analysis as it first appears. With some adjustment is should be possible to remove the remaining differences and improve the correspondence between the two theories.

P.S. Sorry about the delays in publishing my analysis. I am on the final draft. I have been diverted by testing different window functions to see if I can improve the results (it doesn't), re-compressing my entire CD collection with Nero, and I took some holidays in Stockholm. Stay tuned.

AMTuring
21st May 2005, 10:30
I have finally finished by paper on codec comparisons. I still need to dig into the psychoacoustic model a bit more to see if the procedure can be improved.

Here is a table to summarize the results:

http://home.broadpark.no/~clarkt/Summary.gif

You will note that Nero dropped on the 96 kpbs test and came up on the 192 kbps test. This is because I had use the 124 kpbs for 96 kbps originally (trying to avoid the HE profile in AAC) and I used the 180 kpbs for the 192 comparison. I switched this to the next higher setting (extreme) because I was not using this setting in any other comparisons.

The full paper (with working hyperlinks) is available here:
Correlation Comparison of Codecs (http://home.broadpark.no/~clarkt/Codec_Compare_Final.pdf)

Again any feedback is welcome.

I am currently reading through the PEAQ standard (which produced the eaqual program). I can see why mathematical measurements have a bad reputation.

johnman
22nd May 2005, 21:09
You made it to the frontpage of cdfreaks.

good job Turing :p

Gambit
22nd May 2005, 22:33
Nice.

But completely useless.

AMTuring
23rd May 2005, 18:04
@johnman - Hey, no snickering.

@Gambit - Useless to you but invaluable to me. I refuse to take my advice from a site dominated by paranoia and self serving (ha ha). For some crazy reason I find mathematics more trustworthy than the ears of one or two people. How I waste my time is by business, right? Anyway thanks for the nice comment.

@everybody

Most people won't actually read the paper, so I am reproducing here some key conclusions that apply to listening tests as well as mathematical tests:

Several things emerged as a result of this study that should also be taken as lessons for those conducting listening tests:

1. Use the Games-Howell test: The Tukey HSD test should not be used for either listening tests or mathematical based codec comparisons. This is due to the wide range of mean-squared error in different codec scores. The Games-Howell significance test offers the similar high degree of confidence and accounts for the difference in mean-squared error.

2. Compare Residual Scores: Statistical comparisons of scores should not be based on the raw scores from different audio clips. Even the Games-Howell test assumes that each codec has a constant mean. This assumption is not true because different clips produce different quality for both listening tests and mathematical comparisons. Much more discrimination can be inferred if the residual scores are compared (i.e. subtract the average score for each clip from the individual scores). Unless you think that all listeners are the same the same procedure should be applied in multi-listener tests on individual clips as well.

Kurtnoise
25th May 2005, 20:07
Originally posted by AMTuring
Most people won't actually read the paper...
maybe because most of them don't understand such statistics tests. ;)


If you want to have more interest, try to explain little bit more some statistical theory. btw, It's very interesting....at least for me.

gURuBoOleZ
26th May 2005, 09:44
According to your methodology (objective comparison), mpc --extreme (200 kbps) is one of the worse competitor at mid/high bitrate (even the worse, if we keep in mind that the worse one is wma at... 150 kbps). Did you ever saw listening tests which conclude on MPC weakness at high bitrate? I'm interested to see a link :)

I'm sorry, but your table is not a "supplement" to ABX tests, but reveal a lot of incoherencies with subjective tests results.

Some comparison are completely senseless too. How could you compare as being in the same class wma at 97 kbps to various encodings with bitrate comprise between 120 and 137 kbps ?! MPC bitrate is 41% higher than WMA one. No need to build an objective methodology to discover that mpc is superior to wma :rolleyes: Why not using CBR@128 instead (2 pass encoding is possible IIRC) as you did with iTunes?
You've also opposed WMA@150 kbps to MPC@217 kbps and Nero@218 kbps: +44% for these two contenders. Wouldn't be WMA CBR at 192 kbps more pertinent?

And why using 3.90.3 with -V6 presets, when the usage of VBR at mid/low bitrate was always discouraged by developers and tuners themselves?


As gambit said: useless (and maybe biased).

AMTuring
26th May 2005, 20:43
It’s not easy to explain, but here goes:

1. Why Tukey is not appropriate for audio comparisons:

Significance tests are used to determine if a difference just arose by chance or it really means something. The test used in most audio tests I have seen is the Tukey HSD (Honestly Significant Difference) test. I contend that this is not the right test for most codec comparisons.

In general, there are random errors in every experiment one of the things being compared will almost always have a higher mean (average value) than the others. However, this could just be "noise" due to random variations.

Tukey assumes that all of the things being compared (medical treatments, experimental protocols, audio codecs, etc.) have the same mean squared error (MSE, i.e. variability). However, all of the evidence is against this assumption for audio codecs. My tests showed the MSE varies by an order of magnitude (factor of 10) in many of the tests.

You will find the same variability between codecs in listening tests on ff123, rjamorim, or Hydrogenaudio. This variability is not due to the method, but is inherent in the difference in codecs. If you think about it, why should audio compression software written by different groups with different methods produce similar variability? They don't.

Why is using Tukey this bad? Since the difference that is judged to be significant is the same for all codecs in a Tukey test you may two kinds of errors. Codecs with small MSE when compared should have a better ability to say whether the difference is significant. Therefore, Tukey can miss significant differences for the low MSE codecs. At the other end of the spectrum, when codecs with a high MSE are compared the Tukey test may say the difference are significant. However, a high MSE means there is less certainty as to the mean (or goodness of the score).

This subject is discussed in many places. See the paper for a handful of the many references available. The recommended solution is to use the Games-Howell significance test. This test calculates a different threshold for each comparison to take the variability into account.

In fact, the recommended procedure in the statistics literature is to run both Tukey and Games-Howell. When Tukey gives significantly more degrees of freedom than Game-Howell it is the better choice. In my tests, Tukey was always below the largest number of degrees of freedom offered by Games-Howell.

2. Why raw scores should not be compared - compare residual scores instead:

Even Games-Howell assumes that the MSE value is constant within a Codec. However, this is definitely not the case. Otherwise how would there be such a thing as an "mp3 killer" audio clip? In fact, all the listening tests show a wide range of scores for any given codec when different audio clips are compared.

The programmers may try to produce constant quality, but they have not succeeded in any of the codecs others or I have tested.

I came up with a simple way to attack this problem. Instead of comparing absolute scores that included the variability between audio clips, I subtracted the mean value of all 8 codecs from each clip. This is what I called the residual score. It preserves which codec did better but removes some of the effects of audio clip sensitivity.

I think the same procedure should be applied to listening tests.

What is bad about not using the residual scores for comparison?

Significance tests require larger differences between scores if there is a high MSE on each codec. I found that removing the audio clip variability reduced the MSE significantly. The result is that smaller differences could be judged to be significant.

AMTuring
26th May 2005, 21:28
@Guruboolez

I'm sorry you find this useless. As I said I did it for myself, but I thought others would be interested.

I find your listening tests useful, but not definitive. If hundreds of Guru's were doing the tests it would be a lot more conclusive.

Yes I'm biased (I will say what my bias was at the end of this post).

I explained the pairing in the paper. There is no unique way to do the groups. WMA was particularly difficult, because it has huge jumps between bitrates as I adjust the quality. Note: I am only using the free version that comes with WMPlayer 10. Maybe there are better tools? Anyway it was 97 or 150 vs. the others.

I agonized over the Nero choice. Roughly speaking it spanned 180 to 220. LAME was around 200. Which comparison is unfair?

My reasoning was that these are the settings most people would use. I am personally only interested in VBR encoding, so I did not test CBR or ABR.

I include iTunes because it is in older tests. It also beat Nero in the past (see rjamorim's AAC tests at 128). When Quicktime 7 gets the VBR mode working I will retest iTunes (or Quicktime if necessary).

Why use 3.90.3 -V6? I did this deliberately as a control. It is not recommended by HA, and I assume that is for a good reason.

Anyway I don't want to argue with you. I really respect your work and I always look forward to your next post on Hydrogenaudio.

My Bias

Finally, what was my bias? I was a LAME 3.96.1 fan. I thought 3.90.3 was an effete piece of junk. I encoded most of my CD collection into 3.96.1 preset standard. I assumed that WMA was the worst. I also assumed that MPC and Ogg Vorbis were really good because of earlier tests and the large user base each one has.

I was really shocked to find Nero at the top. It makes sense from a scientific perspective. AAC has more bells and whistles than any other codec AFAIK. It certainly is a superset of mp3. So it only makes sense that it has the potential to beat mp3.

You will also note that I concluded that 3.90.3 is better than 3.96.1 for preset standard. This is not what my bias would have forced. You should also note that my techniques agree with your opinion of LAME 3.97 alpha 10 LAME version comparison with correlation (http://forum.doom9.org/showthread.php?s=&threadid=93136)

AMTuring
27th May 2005, 17:54
Guruboolez's post brings up and important point. I am short of good listening tests to compare to. Roberto Amorim's test was over a year old and it was already difficult to reproduce the software versions.

I could also really use some higher bitrate comparisons (192 or 256 kbps). These seem to be very limited.

I would appreciate any links to good tests for comparison.

I am studying up on the physiology of hearing and psychoacoustic models right now. When I have digested some of this I plan on embarking on a serious campaign to develop a method that will match the best human testers like Guruboolez.

The current results are only indicative of what can be done. There are still some flaws in the method, but they are not incurable.

Thanks,

ff123
6th June 2005, 15:54
If you want a statistical method which takes into account the correlations of the data, I suggest the bootstrap resampling method which I coded up here:

http://ff123.net/bootstrap/

The one nice thing about Tukey's HSD is the very thing you complain about -- the fact that the error bars are the same for every comparison. But that allows us to draw an easily understood graph. Say, for example, that there are 6 codecs to be compared. With Tukey's HSD, you can draw a graph with 6 means and 6 error bars. With the bootstrap resampling method, you must look at a table of p-values (there will be 15 p-values for 6 codecs).

If you take various data sets for listening tests which have been performed, you'll find that Tukey's HSD actually does quite well in comparison with bootstrap resampling, and even in comparison with a plain ANOVA without any type of Bonferroni correction which adjusts for multiple comparisons.

In any case, I think the biggest statistical problem with the listening tests we've performed is in trying to rate codecs based on combining the results for multiple music samples. Technically, the tests only say something about each music sample individually. I'd much prefer a gigantic set of music samples (say, like 30 or more), and a huge group of listeners (say 30 people or more), where each listener rates all codecs for each music sample. But I'm dreaming of course.

ff123

AMTuring
10th June 2005, 22:07
Thanks for you helpful comments. I will read up on your post.

However, I think you should consider the Games-Howell test. From my internet research, the proper procedure is to do Tukey and Games-Howell. Use which ever one gives the most degrees of freedom. You don't really know unless you try both.

I also dream of large numbers of testers on large numbers of song clips. This appears to be a dream.

I will continue to work on computer based methods to make this dream a reality. I have a new algorithm to simulate the psychoacoustics of the human ear. It looks promising. However, I am busy with work right now, and it will be some time before I can get back on this research in a serious way.

Isn't it odd how people have no problem with the concept of a computer program that can create compressed audio that passes a listening test, but the even easier sounding idea of measuring quality is impossible?

Thanks again ff123. You and Roberto Amorim have inspired me to continue publishing my results.