Doom9's Forum - View Single Post - MSU Video Codecs Comparison 2019

benwaggoner · 19th February 2019, 21:39

It's nice to see this year's updates! I look forward to the results. I do have a few questions and comments, based on this post and the linked rules.

It is very odd to read "ripping" in 2019. Might be better to call it "offline" or "high quality."
Ultra-ripping is only half the speed of ripping; not a very big jump. I'd suggest making encoding time unlimited (but reported), or <=0.25fps. AV1 can get SLOW
Having UHD only at 20fps seems really limited; that's fewer MIPS/pixel than FullHD Fast! Having a 1fps test and perhaps a 0.2fps would allow use of more advanced HEVC features in UHD.
It seems there should be SOME limit on GOP duration, probably somewhere 4-10 seconds. That will ensure IDR placement and quality of inter-GOP transitions is included.
I also suggest having some VBV limits and level limits. It would be reasonable to limit to at least the maximum allowed in the lowest profile @ level that supports the frame size and fps targeted. Accurate VBV compliance and maintaining quality at VBV peaks is an important encoder feature.
VMAF now supports mobile, HD, and UHD scores. It would be good to have all of those available. And the UHD tests should focus on the UHD VMAF. I also recommend using the harmonic mean (HVMAF) instead of just mean, as that is more sensitive to individual low-quality frames.
It isn't specified if this will be 8-bit or 10-bit encoding. Probably 8-bit?
It isn't specified if this is all SDR or if there is some HDR. Optimal HDR encoding would require different parameters.
1 Mbps might be too high a bottom floor for 1080p24 HEVC. I've gotten interesting results <<1 Mbps in my codec shootout http://forum.doom9.org/showthread.php?t=175776
Personally, I'd love to see some dual-Xeon examples as performance optimization for multi-socket systems is challenging, but that is probably what most UHD HEVC encoding is done on.
I see that it is possible to nominate a "cloud solution" but there is no data on what that covers/entails.

More broadly, I encourage some rumination and clarity around objective versus subjective metrics. The interaction of tuning and metrics is complex. Many modern encoders have modes that will optimize for PSNR, SSIM, subjective quality, and in some cases VMAF. Which metric should proponents tune for? I would argue that subjective quality is the only meaningful metric, and the others are informational to ballpark subjective quality without having to do subjective evaluation. VMAF itself is a ML system to predict subjective quality ratings, which is what VMAF score attempts to predict. So, what's the goal here? If it is to see how well each codec can do for each metric, it seems reasonable to let proponents provide settings tuned for each metric, and then to score the tuned version for each metric to rate that metric. If the goal is to make a "general" encode intended to score well in PSNR, SSIM, VMAF, and subjective ratings, that seems muddled, and will result in parameters that somewhat degrade subjective quality in order to bump scores on the other metrics. It is silly to tune an encoder to produce better PSNR at the expense of worse looking video! I would suggest focusing on subjective ratings, and specifying that is what proponents should tune for. PSNR, SSIM, and VMAF can still be measured at reported; it would be interesting to see how they do and don't correlate with subjective ratings. But someone picking an encoder for real-world use should ONLY care about subjective quality.

I argue that the focus on PSNR in codec development and encoder tuning has become more of a distraction than a help. PSNR made sense decades ago as something with some subjective correlation that was cheap to calculate. But it's always been inaccurate, and its subjective quality correlation goes down the more encoders tune for PSNR specifically. This happens with every metric; once it because a key differentiator, developers start to develop for the metric in ways that improve the metric more than they improve subjective quality, meaning the metric has less subjective correlation. Encoders with --tune vmaf will have less meaningful VMAF ratings (although tuning for VMAF will certainly be subjectively superior to tuning for PSNR!). We've already seen studies where libaom produces better PSNR but lower subjective ratings than the HM.

Given the huge number of clips that'll be produced, I don't know if it is feasible to get MOS scores for all of them. But doing as many as you can (for perhaps a limited set of sources and bitrates that the objective metrics show to be "interesting") will make your report much more meaningful.

After all, these reports trigger lots of discussion about "my codec is best!" so, fronting the best "best" is great for the industry.

19th February 2019, 21:39	#2 \| Link
benwaggoner Moderator Join Date: Jan 2006 Location: Portland, OR Posts: 4,770	Questions and comments about 2019 MSU call for codecs It's nice to see this year's updates! I look forward to the results. I do have a few questions and comments, based on this post and the linked rules. It is very odd to read "ripping" in 2019. Might be better to call it "offline" or "high quality." Ultra-ripping is only half the speed of ripping; not a very big jump. I'd suggest making encoding time unlimited (but reported), or <=0.25fps. AV1 can get SLOW Having UHD only at 20fps seems really limited; that's fewer MIPS/pixel than FullHD Fast! Having a 1fps test and perhaps a 0.2fps would allow use of more advanced HEVC features in UHD. It seems there should be SOME limit on GOP duration, probably somewhere 4-10 seconds. That will ensure IDR placement and quality of inter-GOP transitions is included. I also suggest having some VBV limits and level limits. It would be reasonable to limit to at least the maximum allowed in the lowest profile @ level that supports the frame size and fps targeted. Accurate VBV compliance and maintaining quality at VBV peaks is an important encoder feature. VMAF now supports mobile, HD, and UHD scores. It would be good to have all of those available. And the UHD tests should focus on the UHD VMAF. I also recommend using the harmonic mean (HVMAF) instead of just mean, as that is more sensitive to individual low-quality frames. It isn't specified if this will be 8-bit or 10-bit encoding. Probably 8-bit? It isn't specified if this is all SDR or if there is some HDR. Optimal HDR encoding would require different parameters. 1 Mbps might be too high a bottom floor for 1080p24 HEVC. I've gotten interesting results <<1 Mbps in my codec shootout http://forum.doom9.org/showthread.php?t=175776 Personally, I'd love to see some dual-Xeon examples as performance optimization for multi-socket systems is challenging, but that is probably what most UHD HEVC encoding is done on. I see that it is possible to nominate a "cloud solution" but there is no data on what that covers/entails. More broadly, I encourage some rumination and clarity around objective versus subjective metrics. The interaction of tuning and metrics is complex. Many modern encoders have modes that will optimize for PSNR, SSIM, subjective quality, and in some cases VMAF. Which metric should proponents tune for? I would argue that subjective quality is the only meaningful metric, and the others are informational to ballpark subjective quality without having to do subjective evaluation. VMAF itself is a ML system to predict subjective quality ratings, which is what VMAF score attempts to predict. So, what's the goal here? If it is to see how well each codec can do for each metric, it seems reasonable to let proponents provide settings tuned for each metric, and then to score the tuned version for each metric to rate that metric. If the goal is to make a "general" encode intended to score well in PSNR, SSIM, VMAF, and subjective ratings, that seems muddled, and will result in parameters that somewhat degrade subjective quality in order to bump scores on the other metrics. It is silly to tune an encoder to produce better PSNR at the expense of worse looking video! I would suggest focusing on subjective ratings, and specifying that is what proponents should tune for. PSNR, SSIM, and VMAF can still be measured at reported; it would be interesting to see how they do and don't correlate with subjective ratings. But someone picking an encoder for real-world use should ONLY care about subjective quality. I argue that the focus on PSNR in codec development and encoder tuning has become more of a distraction than a help. PSNR made sense decades ago as something with some subjective correlation that was cheap to calculate. But it's always been inaccurate, and its subjective quality correlation goes down the more encoders tune for PSNR specifically. This happens with every metric; once it because a key differentiator, developers start to develop for the metric in ways that improve the metric more than they improve subjective quality, meaning the metric has less subjective correlation. Encoders with --tune vmaf will have less meaningful VMAF ratings (although tuning for VMAF will certainly be subjectively superior to tuning for PSNR!). We've already seen studies where libaom produces better PSNR but lower subjective ratings than the HM. Given the huge number of clips that'll be produced, I don't know if it is feasible to get MOS scores for all of them. But doing as many as you can (for perhaps a limited set of sources and bitrates that the objective metrics show to be "interesting") will make your report much more meaningful. After all, these reports trigger lots of discussion about "my codec is best!" so, fronting the best "best" is great for the industry. __________________ Ben Waggoner Principal Video Specialist, Amazon Prime Video My Compression Book Last edited by benwaggoner; 19th February 2019 at 22:02. Reason: Added metrics rumination and plea