View Full Version : GMSD and SSIM Quality Metrics
Iron_Mike
24th March 2019, 10:23
alright, updated my other two posts that show the metrics test results for x265 preset medium vs. fast
https://forum.doom9.org/showpost.php?p=1869215&postcount=138
https://forum.doom9.org/showpost.php?p=1869644&postcount=238
ChaosKing
24th March 2019, 11:44
Nice results.
But lsmash 2017 has AV1 support? That can't be correct. Could you check manually?
ChaosKing
24th March 2019, 12:21
Replace this file in VapourSynth64: https://www.dropbox.com/s/ivtbc83g3yyibj0/seek-test.zip?dl=1
I only checked for vs.Error.
Edit: argh I should have tested it more, still doesn't work.
Edit2: It's working now. lsmash doesn't throw an exception with invalid files...
Wolfberry
24th March 2019, 12:45
New builds for testing (http://forum.doom9.org/showthread.php?t=176198)
ChaosKing
24th March 2019, 13:12
Thx for the new dlls
Update https://www.dropbox.com/s/illkfly7wq72ay7/VapourSynth64PortableTEST.rar?dl=1
- Fixed spaces in filenames are now allowed
- Added Wolfberrys dlls
Wolfberry
24th March 2019, 13:54
The new ffms2 with dav1d is bascially the same as the one labeled 2.31.0.0, the difference is 1 commit in dav1d and openssl.
You can keep only one for testing and results should be the same.
L-SMASH with libvpx has no seeking issues for the vp9 files I tested.
WorBry
24th March 2019, 16:27
Attachments approval takes forever. Better upload it somewhere else, like imgur.com :)
Maybe "Torture File Chamber"? :devil:
Frame Seeker?
'You've Been Framed' ? :p
poisondeathray
24th March 2019, 16:40
Effect of muxer on seeking ?
When you guys a chance , can you test this crowd run encode ? It's encoded with the modded yuuki build (default settings, with --keyint 100 --min-keyint 100 ) . It is compiled with lsmash support , so can export MP4 directly instead of ES
https://www.mediafire.com/file/mwij210cqk4sybn/yuuki.mp4/file
https://down.7086.in/x265-Yuuki-Asuna/
1) using default seekmode, seek-test ok here, and metric results consistent using the --min-keyint 100 --keyint 100 settings. Verified that open GOP's present as well, same as other x265 CLI and ffmpeg libx265 builds
2) But I cannot mux a "normal" x265 ES with independent lsmash muxer and have it work (I tried differen x265 binaries) . You can't even output an ES from the same yuuki mod and mux with the lsmash muxer separately and have it working for seek-test and consistent metrics with default seekmode . I tried several versions of lsmash muxer. Something to do with muxing as it encodes, or some muxer setting (I played with some settings)
I tried other muxers, mp4box (different settings in the full help) , other x265 builds (in case ffmpeg muxer, which has been known to have problems in the past, was causing issues; or if ffmpeg libx265 was causing issues)
3) Newer MP4Box / GPAC muxers will change the brand, because it auto detects open GOP's . And the "codec id" and compatible brands for the container will be "iso4 (iso4/iso6)" instead of "mp42 (mp42/mp41/isom)" . But it does not affect the seek test results
Stream uses forward prediction - stream CTS offset: 2 frames
OpenGOP detected - adjusting file brand
4) I tried looking at the box dumps both with different mp4box and lsmash muxes, but I cannot figure out what causes the difference for seeking. There are notable timescale differences between muxers for the movie timescale vs. track timescale. But even changing those settings to match the "seeking" yuuki file, it does not work for default seekmode with the other encodes
ChaosKing
24th March 2019, 20:19
'You've Been Framed' ? :p
- Frame & Seek
- Frame Me If You Can
Edit
The new lsmash is better:
https://i.imgur.com/PL8dYXo.png
WorBry
26th March 2019, 17:13
more stats - x265 - preset: medium vs. fast
Butteraugli / MDSI per CRF
https://i.imgur.com/6QVs6Ru.png
VMAF / SSIM / GMSD per bitrate (Mbps)
https://i.imgur.com/PM5RWBo.png
Butteraugli / MDSI per bitrate (Mbps)
https://i.imgur.com/EzIjj93.png
So, little, if any difference in the bitrate-matched metric scores, for this particular source at least.
Couple of points to note:
I see the muvsfunc SSIM and GMSD tests were run with no downsampling. VMAF does apply downsampling in deriving the component SSIM and MS-SSIM scores. That said in my initial tests with Crowd Run the VMAF derived SSIM scores were still higher than muvsfunc SSIM with downsampling applied - I'm still not sure why.
The MDSI tests on the other hand were run with downsampling applied (down_scale=2). Any particular reason for that ?
For graphic presentation of the results obtained over a wide bitrate range I found it more informative to plot the bitrate (x axis) as log base 2 - gives better separation of the scores at the lower bitrates. I think it's helpful to include the actual data points also.
Just something to bear in mind if you are going to be posting more of these tests.
@Wolfberry. Any chance you could add an implementation of the MS-SSIM metric to muvsfunc ?
https://ece.uwaterloo.ca/~z70wang/publications/msssim.pdf
ChaosKing
26th March 2019, 18:35
The vmaf plugin can also calc "psnr, ssim and ms_ssim" https://github.com/HomeOfVapourSynthEvolution/VapourSynth-VMAF#usage
WorBry
26th March 2019, 19:14
I know, but for some reason VMAF derives markedly higher aggregate (elementary) SSIM scores than muvsfunc SSIM (downsample=True).
I speculated earlier in the thread whether that might stem from differences in the size of the gauss kernel ?
https://forum.doom9.org/showthread.php?p=1865872#post1865872
Or is VMAF adding some other weighting in the SSIM and MS-SSIM calculations ?
Given these unknowns, I just think it would be useful to have a muvsfunc version of MS-SSIM that is implemented in like manner to (SS-) SSIM.
Iron_Mike
27th March 2019, 04:23
The MDSI tests on the other hand were run with downsampling applied (Downsample=2). Any particular reason for that ?
I thought Downsample=2 means Downsample=False... the MDSI implementation is more than confusing, should be 0/1 for False/True...
can somebody clarify ?
WorBry
27th March 2019, 05:21
I did clarify. Look at the MDSI function in the muvsfunc script and the Zoptilib default for MDSI.
.... should be 0/1 for False/True...
No it shouldn't. With the MDSI metric, the 'down-scale' parameter specifies the factor of downsampling (downscaling). Quote:
down_scale: (int) Factor of downsampling before quality assessment.
Default is 1.
So down_scale=1 (default) applies no downsampling and down-scale=2 downsamples by a factor of two.
You only need to run a test to see that down_scale=1 gives a lower score (i.e. higher figure) than down_scale=2.
WorBry
27th March 2019, 05:34
Thank-you HolyWu. That would explain it.
Iron_Mike
29th March 2019, 04:21
alright, here are some more tests I ran on these IQ metrics...
this is from a 16-bit float EXR source (15 secs, 360 frames), made nine (9) x265 encodes, all CRF 10, in these formats (using Wolfberry ffmpeg build): yuv420p, yuv422p, yuv444p, yuv420p10le, yuv422p10le, yuv444p10le, yuv420p12le, yuv422p12le, yuv444p12le
A few interesting results, but maybe some folks here can chime in on why things are the way they are...
Pixel Format vs. Bitrate
https://i.imgur.com/g6pObHL.png
--> why does 422 variants result in highest bit rates across 8/10/12 bit ? and yuv420p has higher bit rate than yuv444p12le ?
Libvmaf via FFMPEG (converts to yuv444p10le internally)
https://i.imgur.com/lK1bCMg.png
Libvmaf via VS (requires conversion to yuv444p10, if higher format)
https://i.imgur.com/o5b4UNC.png
--> while the general VMAF score trend/indication is the same in ffmpeg/VS, the SSIM/MS-SSIM results differ... VS ones are flat across all formats, ffmpeg 8-bit has higher score than 10/12-bit... (?)
GMSD / SSIM (muvsfunc)
https://i.imgur.com/XvTbjEQ.png
--> flat across 8-bit, and then finds no difference across the 10/12 bit formats...
MDSI / Butteraugli (requires 8-bit files)
https://i.imgur.com/DZMJFUe.png
--> Butteraugli does not seem to find a difference in any of the encodes... MDSI, from 10-bit on no more differences...
WorBry
29th March 2019, 17:14
Given the source....
this is from a 16-bit float EXR source
...the VS metric results don't surprise me.
Libvmaf via FFMPEG (converts to yuv444p10le internally)
https://i.imgur.com/lK1bCMg.png
Libvmaf via VS (requires conversion to yuv444p10, if higher format)
https://i.imgur.com/o5b4UNC.png
--> while the general VMAF score trend/indication is the same in ffmpeg/VS, the SSIM/MS-SSIM results differ... VS ones are flat across all formats, ffmpeg 8-bit has higher score than 10/12-bit... (?)
Looks like there is a slight increase in the VS libvmaf SSIM/MS-SSIM scores going from 8 to 10-bit - you'd likely see that more clearly with the SSIM/MS-SSIM scores plotted on a secondary Y axis that expands out the 99.75 - 100 range.
The drop in FFMPEG libvmaf SSIM/MS-SSIM scores is interesting though. Perhaps HolyWu has a perspective on that ?
Libvmaf via FFMPEG (converts to yuv444p10le internally)....
Libvmaf via VS (requires conversion to yuv444p10, if higher format)
I thought libvmaf internally converts to 8bit for all of component (elementary) metrics which are applied to the luma plane only. Correct or no?
Iron_Mike
29th March 2019, 22:31
I thought libvmaf internally converts to 8bit for all of component (elementary) metrics which are applied to the luma plane only. Correct or no?
not sure.
ffmpeg prints yuv444p10 as the format when calculating via libvmaf - u can feed it anything and it will convert internally.
VS VMAF allows input of up to yuv444p10 (u need to convert to that if source is higher), not sure what it does internally then.
I did ran one test in which I converted everything to 8-bit before feeding it to VS VMAF and the SSIM and MS-SSIM lines on the plot were absolutely straight across the 8/10/12 bit formats.
WorBry
29th March 2019, 22:38
I did ran one test in which I converted everything to 8-bit before feeding it to VS VMAF and the SSIM and MS-SSIM lines on the plot were absolutely straight across the 8/10/12 bit formats.
Presumably in that case you converted the EXR source/reference to 8bit also ?
WorBry
29th March 2019, 22:52
I thought libvmaf internally converts to 8bit for all of component (elementary) metrics, which are applied to the luma plane only. Correct or no?
But not in the ffmpeg libvmaf implementation, according to HolyWu here:
...because FFmpeg filter doesn't normalize 10 bit to 8 bit like what Netflix does for calculation, hence the inconsistency.
https://forum.doom9.org/showthread.php?p=1868436#post1868436
...same thread :
do you have a link to where they state that they downsample the master to 8bit ?
Netflix doesn't explicitly mention that in the documentation. It's simply done this way in their source code.
https://forum.doom9.org/showthread.php?p=1868966#post1868966
Iron_Mike
29th March 2019, 23:08
Presumably in that case you converted the EXR source/reference to 8bit also ?
yes, in that one test that I mentioned (which is not on the pics posted) everything higher than yuv444p was converted to yuv444p (including the 16-bit EXR reference) before feeding it to VS VMAF - if the encode that is evaluated was lower format than yuv444p (e.g. yuv420p, yuv422p), it was converted to yuv444p before feeding it to VS VMAF - as VMAF requires matching pixel formats
--> result was a flat line for libvmaf SSIM / MS-SSIM
the results on the posted pic followed the same workflow. but foimat fed to VS VMAF was yuv444p10 (highest supported)
--> result on the posted pic is almost the same, tiny bit lower score in 8bit formats.
So, any idea why the bit rate for 422 formats is the highest across 8/10/12 bit ? and why do 12-bit formats generally have lower bit rates than 8/10 bit ?
it also can bee seen that from 10-bit on, these metrics don't pick up differences any longer... And Butteraugli sees no difference at all...
Iron_Mike
29th March 2019, 23:14
But not in the ffmpeg libvmaf implementation, according to HolyWu here:
https://forum.doom9.org/showthread.php?p=1868436#post1868436
...same thread :
https://forum.doom9.org/showthread.php?p=1868966#post1868966
right, but that does not explain why FFMPEG libvmaf SSIM/MS-SSIM for 8bit has higher scores than for 10/12 bit
WorBry
29th March 2019, 23:53
That's what perplexes me also.
Perhaps it would be better to address this point in the VMAF thread, as this is really a continuation of your posts there.
I'm not set-up for ffmpeg libvmaf and so can't verify myself. But just to be clear, when you say....
ffmpeg prints yuv444p10 as the format when calculating via libvmaf - u can feed it anything and it will convert internally.
....does that include the ffmpeg vmaf tests with the 8-bit x265 formats also (i.e. they are up-sampled to yuv444p10 for the vmaf calcs) or does that only apply to the 10/12 bit formats ?
So, any idea why the bit rate for 422 formats is the highest across 8/10/12 bit ? and why do 12-bit formats generally have lower bit rates than 8/10 bit ?
Can't give a precise answer off the top of my head. 422 has more chroma information than 420 and so needs more bits, but why 444 has the lowest bitrate (across 8/10/12 bit depth) at the same CRF level, I'm not sure. Perhaps others could weigh in on that.
Iron_Mike
30th March 2019, 02:56
That's what perplexes me also.
Perhaps it would be better to address this point in the VMAF thread, as this is really a continuation of your posts there.
posted here (https://forum.doom9.org/showpost.php?p=1870364&postcount=96)
I'm not set-up for ffmpeg libvmaf and so can't verify myself. But just to be clear, when you say....
....does that include the ffmpeg vmaf tests with the 8-bit x265 formats also (i.e. they are up-sampled to yuv444p10 for the vmaf calcs) or does that only apply to the 10/12 bit formats ?
(maybe I'm misinterpreting the output) it seems ffmpeg internally evaluates in the same format as the distorted input file, see here...
ref RGB48le, dist yuv420p
https://i.imgur.com/wyNdEy6.png
ref RGB48le, dist yuv420p10le
https://i.imgur.com/Ak3Arqj.png
ref RGB48le, dist yuv444p12le - as you can see it down-converts to yuv444p10 as that is the max supported format
https://i.imgur.com/GB7bY82.png
as stated before, the problem w/ this approach is (IF the ref and dist formats differ), that you lose precision converting the ref down to whatever the format of the dist is, distorting results... it would be a better approach to up-convert the dist to the format of the ref, as you should never ever change the ref, b/c that compromises the whole comparison...
in case of crowdrun 1080p that was never an issue as the ref was already bare bottom yuv420p...
WorBry
30th March 2019, 03:19
Well, I don't know. Probably best to see what HolyWu has to say about it.
poisondeathray
30th March 2019, 03:26
as stated before, the problem w/ this approach is (IF the ref and dist formats differ), that you lose precision converting the ref down to whatever the format of the dist is, distorting results... it would be a better approach to up-convert the dist to the format of the ref, as you should never ever change the ref, b/c that compromises the whole comparison...
in case of crowdrun 1080p that was never an issue as the ref was already bare bottom yuv420p...
Another problem is vmaf does not support 16bit RGB , and ffmpeg does not support EXR float (only 16bit int) .
You are going to have compromises if you want to use vmaf eitherway - you have to change the ref, no way around it
You can make a request to the vmaf developers to extend support for other pixel types
Iron_Mike
30th March 2019, 03:38
Another problem is vmaf does not support 16bit RGB , and ffmpeg does not support EXR float (only 16bit int) .
You are going to have compromises if you want to use vmaf eitherway - you have to change the ref, no way around it
yeah, unfortunately.
I convert 16bit float RGB to 32-bit YUV 444 (to keep precision in VS for some plugins that only support 16bit int) - and that conversion is a small compromise at best IMO compared to converting 16-bit float RGB to yuv420p to then be used as a ref in VMAF comparison...
these metrics are just indications anyways, but the approach of distorting the ref that much is flawed...
You can make a request to the vmaf developers to extend support for other pixel types
the problem is they converting the ref down to the dist, instead of converting the dist up to the ref - independent of supported formats. if your ref is yuv444p and dist is yuv420p (both supported), it makes no sense to down-convert the ref to yuv420p to then compare...
poisondeathray
30th March 2019, 03:48
I don't like it either.
But vmaf is for a specific scenario, viewing distance etc... Streaming web video delivery typically won't be 12bit or 16bit ...
And typically displays won't be 10bit either. Sure, they are becoming more common, but >99% of displays will still be 8bit or lower. vmaf is all about the perception of the video from the end user point of view
As suggested earlier, you can control the conversions and algorithms , instead of letting "auto" conversions do it for you . For example if 10bit444 was the max supported pixel type, you could have controlled the RGB48le conversion to 10bit444 , and the upconversion of yuv420p to 10bit444 in that 1st test . It would make me feel better that it's not as a low downconversion for the source.
WorBry
30th March 2019, 03:50
Bear in mind also that the muvsfunc MDSI and Butteraugli tests require converting both the reference and test clip to RGB, and RGB24 specifically in the case of Butteraugli.
poisondeathray
30th March 2019, 04:06
I thought libvmaf internally converts to 8bit for all of component (elementary) metrics which are applied to the luma plane only. Correct or no?
HolyWu in the vmaf thread said the Netflix implementation converts to 8bit in their source code
And this suggests chroma isn't looked at
Currently, VMAF does not use chroma features, and does not fully express the perceptual advantage of HDR / WCG videos.
https://medium.com/netflix-techblog/vmaf-the-journey-continues-44b51ee9ed12
For the other Iron_Mike tests, they were Y only , right ? For example, not SSIM-U , or SSIM-V . Because you would expect 444 to score better than 420 in terms of color in general . Or what was the point of 444,422,420 testing ?
I don't think the netflix implementation can , but the muvsfunc can specify the plane processed
WorBry
30th March 2019, 04:37
I thought libvmaf internally converts to 8bit for all of component (elementary) metrics which are applied to the luma plane only. Correct or no?
HolyWu in the vmaf thread said the Netflix implementation converts to 8bit in their source code
Yes, I made reference to that subsequently:
https://forum.doom9.org/showthread.php?p=1870341#post1870341
But he also said that the 'ffmpeg filter' (which in context I assume referred to the ffmpeg libvmaf filter) doesn't normalize 10bit to 8bit for the calculations.
And this suggests chroma isn't looked at
https://medium.com/netflix-techblog/vmaf-the-journey-continues-44b51ee9ed12
Was sure I'd seen that stated somewhere.
For the other Iron_Mike tests, they were Y only , right ? For example, not SSIM-U , or SSIM-V . Because you would expect 444 to score better than 420 in terms of color in general . Or what was the point of 444,422,420 testing ?
He did test MDSI and Butteraugli also.
Seems ironic that Butteraugli requires RGB24 yet appears to be quite unresponsive to subtle chroma/chromacity distortions - as seen in the Crowd Run metric tests where I was comparing different 10bit 422 'intermediate formats':
https://forum.doom9.org/showthread.php?p=1868159#post1868159
..and the 3 posts that followed.
I get the impression that Butteraugli is primarily tuned for JPEG compression artifacts. Personally, I don't think it brings any 'added value' in this context, so I've stopped testing it. MDSI on the other hand is very sensitive to chroma distortions.
Iron_Mike
30th March 2019, 05:30
I don't like it either.
But vmaf is for a specific scenario, viewing distance etc... Streaming web video delivery typically won't be 12bit or 16bit ...
And typically displays won't be 10bit either. Sure, they are becoming more common, but >99% of displays will still be 8bit or lower. vmaf is all about the perception of the video from the end user point of view
display tech does not matter (and will change anyways), it's about the approach in general. as I wrote before: when you deliver to NF, Amazon, Hulu, Vimeo or Youtube - you want to make sure you deliver the highest quality file that they allow you to deliver (some limit you in upload size, NF doesn't). if you deliver in 8-bit then you must have shot on a camera from the 90's, or a consumer camera - which is already amateur, meaning NOT a professional production. But even then, your grade will be at least viewed on 10-bit equipment, if it's not then you have no clue what you're doing. Bare bones minimum export out of the grading app is 10-bit, standard is 16-bit float master export b/c the grade and VFX etc are at 32/64 bit internally...
All of these streaming services then take your ref master (that you deliver to them), and make X encodes from it for different resolutions, bit rates etc.... THAT is the process. so, ultimately for metrics you want to compare to the ref master, which is almost never in a professional production 8-bit - but more importantly the bit-depth and the chroma subsampling should not matter in the comparison - u just up-convert the dist to the ref, it won't gain or lose precision.
so with a proper approach, u can compare RGB48 to yuv420p, or yuv422p10 to yuv444p10 - it don't matter.
I understand some of these metrics are just starting out, but the fact that VMAF distorts the ref to the dist format in order to compare is very, very flawed...
As suggested earlier, you can control the conversions and algorithms , instead of letting "auto" conversions do it for you . For example if 10bit444 was the max supported pixel type, you could have controlled the RGB48le conversion to 10bit444 , and the upconversion of yuv420p to 10bit444 in that 1st test . It would make me feel better that it's not as a low downconversion for the source.
I did do that - I mentioned that. Since the ref master was RGB48le, which needed to be converted to yuv444p10le (max supported format by VMAF), every encode was conformed to yuv444p10le as well.
The problem is if VS VMAF internally then converts everything to 8-bit anyways (and the way I interpreted it, Holy Wu confirmed that), then the question is shouldn't I just bring everything down to yuv444p... ? (which is why I made that one test mentioned on the last page, but then the results were completely useless)
there are two problems with this approach of 8-bit in VS VMAF (if res src is higher than 8-bit):
(1) if I have to convert RGB48le to yuv444p in order to run a metric test, the scores will be very close together for a variety of formats which makes all of this meaningless... look at the Buttetraugli results, it requires 8-bit, which I did provide by converting everything down to it... look at the results... there is ZERO difference in the score between yuv420p and yuv444p12le... absolutely idiotic... I can see differences by easily eye (as you can imagine, 12bit 444 vs. 8bit 420)
same for VS VMAF SSIM/MS-SSIM... the more you compromise the ref master (by converting), the more you contaminate the test
(2) it is just very confusing for VS VMAF to state "max supported format is yuv444p10" if you ultimately internally then convert anyways to 8-bit... then just state the exact format your using internally, so that users can control the conversion to it
WorBry
30th March 2019, 05:50
I see HolyWu responded to your query in the VMAF thread:
https://forum.doom9.org/showthread.php?p=1870370#post1870370
I have to confess, I'm now thoroughly confused re: the behaviors of VS VMAF vs ffmpeg VMAF, and so will stick with the former.
poisondeathray
30th March 2019, 06:29
display tech does not matter (and will change anyways), it's about the approach in general. as I wrote before: when you deliver to NF, Amazon, Hulu, Vimeo or Youtube - you want to make sure you deliver the highest quality file that they allow you to deliver (some limit you in upload size, NF doesn't). if you deliver in 8-bit then you must have shot on a camera from the 90's, or a consumer camera - which is already amateur, meaning NOT a professional production. But even then, your grade will be at least viewed on 10-bit equipment, if it's not then you have no clue what you're doing. Bare bones minimum export out of the grading app is 10-bit, standard is 16-bit float master export b/c the grade and VFX etc are at 32/64 bit internally...
Yes, I agree in general. But you don't need a metric to tell you what is the "highest" quality for the submission according to the guidelines (at least I hope you don't...) .
But you were referring to VMAF - You're talking about a specific implementation of a specific metric that deals with a specific viewing conditions on current displays, not ones 20 years from now. That's what VMAF is about now . Maybe VMAF in 2039 will be different. It's not about acquisition formats or any post production or any of that. They didn't train their models to account for whether or not it was a full 32bpc float pipeline. Irrelevant to their models and testing and viewing conditions. Remember, VMAF wasn't trained on master formats, so it's not valid for what your master submission format anyways. Maybe they can expand it one day, add more models and extensions .
All of these streaming services then take your ref master (that you deliver to them), and make X encodes from it for different resolutions, bit rates etc.... THAT is the process. so, ultimately for metrics you want to compare to the ref master, which is almost never in a professional production 8-bit - but more importantly the bit-depth and the chroma subsampling should not matter in the comparison - u just up-convert the dist to the ref, it won't gain or lose precision.
so with a proper approach, u can compare RGB48 to yuv420p, or yuv422p10 to yuv444p10 - it don't matter.
Yes, the proper approach is to convert to the uniform format, as stated way back. So you can precisely control all the variables about how the conversions are done , what algorithms are used. Many of the earlier netflix ffmpeg issues in the github tracker were because of using the wrong resizing flags.
Ideally, if your "master" was 16bit float, then you'd use a metric that could measure 16bit float, and you upconvert your other distribution versions to measure at the reference format. But we don't really live in an ideal world. I don't know of any that can properly measure float formats
I understand some of these metrics are just starting out, but the fact that VMAF distorts the ref to the dist format in order to compare is very, very flawed...
Yes . I think that's this newer patch for ffmpeg2vmaf's "bandaid" solution, because you needed a common format . The older implementation would throw an error. You had to do the conversion
The problem is if VS VMAF internally then converts everything to 8-bit anyways (and the way I interpreted it, Holy Wu confirmed that), then the question is shouldn't I just bring everything down to yuv444p... ? (which is why I made that one test mentioned on the last page, but then the results were completely useless)
But apparently normalized values are stored in FP...
https://forum.doom9.org/showthread.php?p=1870370#post1870370
vmafossexec (the CLI of libvmaf) also does this normalization for 10-bit input. The normalized values are stored in floating-point, hence you needn't worry about precision lost.
same for VS VMAF SSIM/MS-SSIM... the more you compromise the ref master (by converting), the more you contaminate the test
I agree ; ideally the reference pixel format should be what you're measuring against.
(2) it is just very confusing for VS VMAF to state "max supported format is yuv444p10" if you ultimately internally then convert anyways to 8-bit... then just state the exact format your using internally, so that users can control the conversion to it
Still not clear on that because of the FP comments
I did some earlier tests and there were differences between a yuv420p10 and it's yuv420p8 derivative . But now that I think about it, it might have been the ffmpeg vmaf implementation which does not downconvert to 8bit . I'll have to check later
Iron_Mike
30th March 2019, 08:22
Yes, I agree in general. But you don't need a metric to tell you what is the "highest" quality for the submission according to the guidelines (at least I hope you don't...) .
highest quality will be lossless 16-bit float EXR, but you won't provide that to these folks... because you would provide terabytes of data...
so, you want a master codec that brings down file size as much as possible and still looks visually lossless... along the lines of WorBry's recent codec test w/ Prores etc...
then, sometimes they restrict you in upload size etc etc, hence the need to use a codec/bit depth/chroma combo that performs as good as possible given the allowed bit rate...
and then - and I can only stress this - they will decode your delivery and re-encode... funny things can happen there...
in the CRF tests w/ crowdrun we saw that CRF 16 (or better) was the point that got you there... but that was 8-bit 420 on all stages...
in my last tests, one would assume from 10 bit on there is no difference (as the curves flat out), but I think this is only b/c most of these metrics require down-conversion to 8-bit...
which brings me to another point, maybe I could improve the way I down-convert:
RGB48le conversion to YUV 32-bit 444
vid_ref = core.fmtc.matrix(clip=vid_ref, mat=matrix, col_fam=vs.YUV, bits=32)
Conversion to 8-bit
vid_ref = core.fmtc.bitdepth(clip=vid_ref, bits=8)
Matching another video format
vid_enc = core.resize.Bicubic(vid_enc, format=vid_ref.format.id, matrix_s=matrix)
anything wrong with that ?
WorBry
30th March 2019, 16:48
No, libvmaf doesn't convert the format. It's FFmpeg automatically doing the format conversion when the format of the two input clips differs, before sending frames into libvmaf. Because FFmpeg puts the distorted clip as the first argument and the reference clip as the second argument, hence FFmpeg converts the format of the second clip to that of the first clip when they differ.
Same as the ffmpeg SSIM and PSNR filters then.
poisondeathray
30th March 2019, 18:10
so, you want a master codec that brings down file size as much as possible and still looks visually lossless... along the lines of WorBry's recent codec test w/ Prores etc...
Applying a test designed for lower delivery bitrates to qualify a high bitrate "master" doesn't seem right either.
Remember , the current VMAF was trained from CRF 22 to CRF 28 (they don't say whether that was x264 or x265, 8bit or 10bit) . But either way, this is certainly not "master" quality.
(Also , beware the quantizer scaling is different between x264, x265, 8bit and 10bit . The numbers don't mean the same thing)
in the CRF tests w/ crowdrun we saw that CRF 16 (or better) was the point that got you there... but that was 8-bit 420 on all stages...
in my last tests, one would assume from 10 bit on there is no difference (as the curves flat out), but I think this is only b/c most of these metrics require down-conversion to 8-bit...
But clearly the "results" are problematic . You can "see" the difference, say, at CRF 10 than CRF 16 . You can see more details, fewer artifacts. Or at least you should be able to if you are in that 4% . Production/Post-production people should be able to. You're not "Joe Public" . That early plateau indicates VMAF is not suitable for assessing "visually lossless" scenarios
I think it' s more that the test is invalid for that bitrate range ~ CRF10 .
Iron_Mike
30th March 2019, 22:12
But clearly the "results" are problematic . You can "see" the difference, say, at CRF 10 than CRF 16 . You can see more details, fewer artifacts. Or at least you should be able to if you are in that 4% . Production/Post-production people should be able to. You're not "Joe Public" . That early plateau indicates VMAF is not suitable for assessing "visually lossless" scenarios
I think it' s more that the test is invalid for that bitrate range ~ CRF10 .
yeah, I agree. hence u need to run various of these metrics and then look at trends/indication.
Also , beware the quantizer scaling is different between x264, x265, 8bit and 10bit . The numbers don't mean the same thing
can u eloborate on this ?
poisondeathray
30th March 2019, 22:51
can u eloborate on this ?
crf "x" is really a rate control method, but it can be used (often abused) as a rough estimate of "quality"
But the value "x" doesn't mean the same thing in 8 bit, 10bit , x264 or x265 . eg. You can't say CRF 16 with x264 8bit produces roughly the same thing as CRF 16 x265 8bit . It's really just an arbitrary number
A hint is the default value is 23 for x264 8bit but 28 for x265 8bit . The scale is 0-51 in 8bit, 0-63 in 10bit x264 10bit .
"0" is lossless for x264 (both 8,10bit) , but a qp of "4" is lossless for x265
In the past they've rescaled the crf calculations several times too. There were (-) crf values allowed at one point for higher bitdepths .
Iron_Mike
30th March 2019, 23:42
crf "x" is really a rate control method, but it can be used (often abused) as a rough estimate of "quality"
But the value "x" doesn't mean the same thing in 8 bit, 10bit , x264 or x265 . eg. You can't say CRF 16 with x264 8bit produces roughly the same thing as CRF 16 x265 8bit . It's really just an arbitrary number
A hint is the default value is 23 for x264 8bit but 28 for x265 8bit . The scale is 0-51 in 8bit, 0-63 in 10bit x264 10bit .
"0" is lossless for x264 (both 8,10bit) , but a qp of "4" is lossless for x265
In the past they've rescaled the crf calculations several times too. There were (-) crf values allowed at one point for higher bitdepths .
makes sense. so what's ur take on (re my last pics uploaded here) that the bit rates for 12-bit were generally the lowest, and that 422 flavor in all bit-depths had the highest bit rate ?
the higher pixel formats did outscore 8-bit (in the metrics) but the bit rates were lower... I expected bit rates trends to be roughly the same for a given chroma flavor 420|422|444...
poisondeathray
31st March 2019, 00:03
so what's ur take on (re my last pics uploaded here) that the bit rates for 12-bit were generally the lowest, and that 422 flavor in all bit-depths had the highest bit rate ?
I don't know .
Interesting trend , but not sure it means anything . You have to be careful how you interpret anything done with CRF . As soon as you change 1 thing, all bets are off
the higher bit rates did outscore 8-bit (in the metrics) but the bit rates were lower... I expected bit rates trends to be roughly the same for a given chroma flavor 420|422|444...
Yes, I expect higher bitrates ,in general, to outscore lower bitrates
But I don't have expectation of bitrate at a given CRF using different settings (different pixel format is a different setting) . In the end, CRF is just a rate control method.
Iron_Mike
31st March 2019, 04:55
Yes, I expect higher bitrates ,in general, to outscore lower bitrates
But I don't have expectation of bitrate at a given CRF using different settings (different pixel format is a different setting) . In the end, CRF is just a rate control method.
sorry, typo on my end:
I meant to say the higher pixel formats (which ended up having lower bit rates than 8-bit) still outscored 8-bit in these metrics...
I don't care about the CRF setting at all, but the bit rate does indicate how much data is in the stream (and all encodes are using the same format here: x265), and lower bit rates did outscore higher bit rates...
so that is interesting.
and then the 422 trend - independent of bit-depth - having the highest bit rate...
maybe this has to do w/ how ffmpeg specifically encodes files...
poisondeathray
31st March 2019, 06:09
I meant to say the higher pixel formats (which ended up having lower bit rates than 8-bit) still outscored 8-bit in these metrics...
I don't care about the CRF setting at all, but the bit rate does indicate how much data is in the stream (and all encodes are using the same format here: x265), and lower bit rates did outscore higher bit rates...
so that is interesting.
In general, you'd expect 10bit encodes to score higher, against 10bit (or even scaled down against 8bit) in general for typical or high bitrate ranges. The compression efficiency and precision of 10bit vs 8bit is higher . This is very well established , many tests , some academic papers. It's the SSIM/MS-SSIM libvmaf dip that seems out of order .
At very low bitrate ranges, however, it reverses. 10bit will cost more, produce lower quality. It's worse with x265 than x264. (or another way to put it is x264 benefits more from 10bit vs. 8bit advantage)
12bit is less researched . And experience /usage is very low. I think the additional gains are low.
and then the 422 trend - independent of bit-depth - having the highest bit rate...
maybe this has to do w/ how ffmpeg specifically encodes files...
Not sure, it's only one test series, one source .
WorBry
31st March 2019, 16:10
and then the 422 trend - independent of bit-depth - having the highest bit rate...
maybe this has to do w/ how ffmpeg specifically encodes files...
Crowd Run shows the same pattern. Here the 1080/50p 10-bit 422 version I used in the ProRes tests converted to x265 420,422 and 444 (8 and 10bit) at CRF=10.
Bitrates:
8bit 4:2:0 175 Mbps
8bit 4:2:2 214 Mbps
8bit 4:4:4 176 Mbps
10bit 4:2:0 174 Mbps
10bit 4:2:2 211 Mbps
10bit 4:4:4 175 Mbps
Haven't run any metrics on them. Been looking at other stuff.
It's the SSIM/MS-SSIM libvmaf dip that seems out of order
It does seem odd, especially when the aggregate VMAF scores show the opposite trend. Evidently there is sufficient bias in the other component metrics to swing it the other way.
That said, if the metric results had been presented with the scores matched for bits i.e. as a ratio of score/bitrate or score/(bits per pixel), which is probably more valid, we'd be seeing a different pattern with respect to 420 vs 422 vs 444.....judging from your data, the bit-matched metric scores for 422 would be lower than 420 and 444 at each bit depth.
BTW - you didn't state the resolution of your EXR source, and also the specific VMAF 'model' used in your tests - in turn dictated by whether you ran in CI=False or True mode. Was it the same in the VS and ffmpeg VMAF tests ?
WorBry
31st March 2019, 21:19
Been looking at other stuff.
Meanwhile, back at the ranch, I've been continuing where I left off:
...whilst Crowd Run serves as a good test reference for it's hard-to-compress complex/colorful content, I wouldn't say it's an especially sharp image, at least by contemporary 4K standards; there's a fair bit of motion and pan blur going on there and it was downscaled (Spline36) to 1080p for these tests also.
......I'll look and see what other HQ sharp sources I can test.
So here I used some UHD (2160/29.970p) footage shot in XAVC-S (8bit 420) format on a Sony AX100.
I used TMPGenc SmartRenderer 5 to sample 4 scenes each of 4 GOP length. The cuts were made on key frames so there was no re-encoding. The resulting test clip was 8.5 sec duration (256 frames).
I encoded the clip to x264 and x265 but limited to CRF 1-5 range - I was more interested in seeing what fine differences the metrics could pick up in the near-lossless domain than quality efficiency at lower bitrates, and especially in light of the results obtained with Boulder's Black Sails clip, where x265 had the edge:
https://forum.doom9.org/showthread.php?p=1868499#post1868499
In this instance however I didn't encode 'All Intra':
ffmpeg -i {Path}:/AX100.mp4 -vcodec libx264 -preset slow -crf {Value} -pix_fmt yuv420p -r 30000/1001 -x264-params colorprim=bt709:transfer=bt709:colormatrix=bt709 {Path}:/AX100_x264_CRFx.mp4
ffmpeg -i {Path}:/AX100.mp4 -vcodec libx265 -preset slow -crf {Value} -pix_fmt yuv420p -r 30000/1001 -x265-params colorprim=1:transfer=1:colormatrix=1 {Path}:/AX100_x265_CRFx.mp4
Here are the muvsfunc SSIM, GMSD and MDSI test results, with and without downsampling. I used LWLibavSouce for import.
http://i.imgur.com/TntEcu3l.png (https://imgur.com/TntEcu3)
Now isn't that interesting. In this case, x264 gives higher (bitrate-matched) scores than x265 for all three metrics. So I turned to the quality maps to see what more could be gleaned. For this I selected frames from the four scenes that I knew to be I frames i.e. the leading I frame at each scene change. The bitrates of the x264 CRF1 and x265 CRF2 encodes were very close (457.4 and 458 Mbps respectively), so I think it's valid to compare the equivalent frames from each.
Here are the quality maps obtained with no downsample applied. As before, I amplified the map traces with two Unsharp Mask passes in Gimp. Note that the original 2160p maps and source frames were downsized to 1080p for these composites and there's a fair bit of aliasing going on in the downsized source frames. So I've included grabs of the original 2160p frames also for reference and close scrutiny.
As usual, click on the image to enlarge, on the (+) cursor to enlarge further and right click 'Save Image As' to download at original resolution.
Frame #0:
http://i.imgur.com/g9mutYKl.jpg (https://imgur.com/g9mutYK)
http://i.imgur.com/eDNcD0El.jpg (https://imgur.com/eDNcD0E)
Frame #65:
http://i.imgur.com/2k252Qzl.jpg (https://imgur.com/2k252Qz)
http://i.imgur.com/6Dx6xZRl.jpg (https://imgur.com/6Dx6xZR)
Frame #129 (It was a different scene)
http://i.imgur.com/aur4XYel.jpg (https://imgur.com/aur4XYe)
http://i.imgur.com/5D6N4iGl.jpg (https://imgur.com/5D6N4iG)
Frame #193
http://i.imgur.com/laKV2yTl.jpg (https://imgur.com/laKV2yT)
http://i.imgur.com/DK8IisSl.jpg (https://imgur.com/DK8IisS)
Feel free to scrutinize and interpret.
In the Black Sails clip tests the higher metric scores with x265 were put down to:
Quite interesting that the maps show such an amount of difference at that bitrate. I think that there you can see the fundamental difference between x264 and x265, the first one has blocking/is more focused on enhancing the edges and higher frequencies by creating "fake detail" at default settings while the latter one likes to blur more.
....and:
I think you're right. Here are crops from another frame.
http://i.imgur.com/NR57nspm.png (https://imgur.com/NR57nsp)
Click image to enlarge and (+) cursor to enlarge further
The x265 image definitely has more blur than x264 (notably on the skin textures), which MDSI deems more acceptable.
With this source however that appears to work against x265, at least in terms of the metric scores. Which looks better from psychovisual perspective of course is another matter that might be open to debate.
Iron_Mike
3rd April 2019, 04:24
BTW - you didn't state the resolution of your EXR source, and also the specific VMAF 'model' used in your tests - in turn dictated by whether you ran in CI=False or True mode. Was it the same in the VS and ffmpeg VMAF tests ?
source/ref resolution and encoding resolution was 1920x1080p.
Model was 0.6.1 as stated on the pics - I used the exact same model in both VS|ffmpeg tests, meaning: the exact same .pkl file.
Iron_Mike
3rd April 2019, 04:30
For the other Iron_Mike tests, they were Y only , right ? For example, not SSIM-U , or SSIM-V . Because you would expect 444 to score better than 420 in terms of color in general . Or what was the point of 444,422,420 testing ?
I don't think the netflix implementation can , but the muvsfunc can specify the plane processed
Could you elaborate on this ?
which SSIM implementation (libvmaf SSIM | libvmaf MS-SSIM | SSIM muvsfunc) processes Y plane only ?
poisondeathray
3rd April 2019, 05:03
I encoded the clip to x264 and x265 but limited to CRF 1-5 range - I was more interested in seeing what fine differences the metrics could pick up in the near-lossless domain than quality efficiency at lower bitrates, and especially in light of the results obtained with Boulder's Black Sails clip, where x265 had the edge:
Interesting CRF 1-5 results.
For lossless encoding, x264 compresses better than x265 by few % ; at least for 8bit 4:2:0 . I've compared them head to head on about 30 sources over the last few years (different types of content), and 100% of the time x264 compresses better .
For moderate to high bitrates , in scenarios where you have adequate bandwidth to attempt to preserve details, I would disable SAO for x265 . It tends to be a detail killer, acting like a smoothing filter. I always do this for my own usage.
Could you elaborate on this ?
which SSIM implementation (libvmaf SSIM | libvmaf MS-SSIM | SSIM muvsfunc) processes Y plane only ?
for muvsfunc, you specify the plane= . You might be able to do all 3 with plane=[0,1,2]
Not sure about netflix libvmaf SSIM, but the earlier Netflix comments say color isn't looked at for VMAF. And you would expect the log should show U, V (or Cb,Cr) channel info - but it doesn't
ffmpeg ssim log will print out SSIM-U, SSIM-V, SSIM-Y , so you know all 3 channels are measured
Then, there are different weighting formulas for Y,U,V . ie. What is the "correct" way to weight the channels ? Simple mean? "Y" is much more important to human vision . You can optimize for some metrics by tweaking chroma offsets for encoders (stack the "Y" by underallocating, U,V) . Since you know Neflix VMAF doesn't look at color, you could allocate more bitrate into Y for example, and score higher
Also I did 8bit/10bit 4:2:0,4:2:2,4:4:4 test on another source at CRF21 and the 4:2:2 was significantly larger , so that's another data point . Still not sure if it means anything, but still interesting
Iron_Mike
3rd April 2019, 06:25
for muvsfunc, you specify the plane= . You might be able to do all 3 with plane=[0,1,2]
Zoptilib and muvsfunc do not unfortunately, u have to do all three planes individually
Also I did 8bit/10bit 4:2:0,4:2:2,4:4:4 test on another source at CRF21 and the 4:2:2 was significantly larger , so that's another data point . Still not sure if it means anything, but still interesting
as of right now, I would assume this has to do with the way the encoder app is handling things...
WorBry
3rd April 2019, 06:33
For moderate to high bitrates , in scenarios where you have adequate bandwidth to attempt to preserve details, I would disable SAO for x265 . It tends to be a detail killer, acting like a smoothing filter. I always do this for my own usage.
Thanks, I'll look at that.
for muvsfunc, you specify the plane= . You might be able to do all 3 with plane=[0,1,2]
Not sure. The function definition only gives:
plane: (int) Specify which plane to be processed. Default is None.
And looking at the function code, I can't see provision for other plane options. I got the impression that it's hard coded for luma only. Same goes for GMSD.
Not sure about netflix libvmaf SSIM, but the earlier Netflix comments say color isn't looked at for VMAF. And you would expect the log should show U, V (or Cb,Cr) channel info - but it doesn't
Simplest way to test is to convert the source and test clips to greyscale and see if it changes the scores. I did that in the Crowd Run 10bit 422 tests with ProRes and Cineform and the muvsfunc SSIM and GMSD scores were not affected at all. Didn't test VMAF on the greyscale clips in that instance, but I'm pretty sure all the component metrics are luma plane only.
ffmpeg ssim log will print out SSIM-U, SSIM-V, SSIM-Y , so you know all 3 channels are measured
Right. Zorr checked out the code for ffmpeg SSIM (AVISynth SSIM also) earlier in the thread. In essence, it's akin to the 'fast' SSIM metric in the MSU Quality Tool:
https://forum.doom9.org/showthread.php?p=1866030#post1866030
https://forum.doom9.org/showthread.php?p=1866162#post1866162
Couple of examples where I plotted out the Y,U,V channel scores.
http://i.imgur.com/ICbJOIMm.png (https://imgur.com/ICbJOIM)
http://i.imgur.com/WZ1HkDLm.png (https://imgur.com/WZ1HkDL)
vBulletin® v3.8.11, Copyright ©2000-2025, vBulletin Solutions Inc.