VMAF - Video Multi-Method Assessment Fusion [Archive] - Page 2

WorBry

11th February 2019, 05:20

Good comparison to show that x265 really has no advantage over x264 on 1080p materials if we're going for transparent encoding.

Now we'll just have to wait for people with high end computer to do the 4K comparison.

Done. So I ran a parallel test series using the original 2160/50p Crowd Run clip (8bit 420 y4m) as source and reference for the metric tests. For VMAF tests I used the VapourSynth (v3) plugin with the vmaf_4k_v0.6.1.pk model (model=1) which aims to "predict the subjective quality of video displayed on a 4KTV and viewed from the distance of 1.5 times the height of the display device (1.5H)"

The x264 results:

http://i.imgur.com/ajUApds.png (https://imgur.com/ajUApds)

http://i.imgur.com/QLT0D6W.png (https://imgur.com/QLT0D6W)

Interesting that the shape of ffmpeg-SSIM vs bitrate plot is quite different to that in the 1080/50p series and the differential between the VMAF and ffmpeg-SSIM scores is larger. The x265 encodes show the same behaviour:

http://i.imgur.com/On2NNyf.png (https://imgur.com/On2NNyf)

Again the VMAF scores deem that x265 has higher perceptual quality over the lower bitrate range.

As to whether there is an advantage over x264 for 'transparent' encoding; well, I looked more closely at what point at which the VMAF plots hit the maximum score of 100.

http://i.imgur.com/qHeRVnr.png (https://imgur.com/qHeRVnr)

For x264 it was at CRF=8 (1296 Mbps) and for x265 at CRF=10 (976 Mbps). So on that basis it could be concluded that x265 is significantly more efficient. That said, if you look at the per-frame VMAF scores, it is clear that the first frame skews the outcome somewhat.

Taking the x264 series first; going from CRF=9 to 16, all frames bar the first frame in each test scored VMAF=100. And in the x265 series also, going from CRF 11 to 17 only the first frame scored less than VMAF=100:

http://i.imgur.com/sVGXdAC.png (https://imgur.com/sVGXdAC)

So, if the aggregate VMAF scores are calculated with the first frame excluded (simple average across the remaining 499 frames), CRF=16 (392 Mbps) becomes the point at which VMAF=100 is reached in the x264 series, and CRF=17 (306 Mbps) in the x265 series:

http://i.imgur.com/s19XMYe.png (https://imgur.com/s19XMYe)

Makes quite a difference. x265 still has the edge on bit savings, but not by as much. I don't have time to calculate 'adjusted' aggregate VMAF scores for the other (lower bitrate) CRF data points. In the 1080/50p series an aggregate VMAF=100 score was never attained for precisely the same reason - the VMAF score of the first frame skewed the aggregate score.

Here are the ffmpeg-SSIM scores obtained in the 2160/50p series at these 'significant' CRF points though:

http://i.imgur.com/0vb8SAu.png (https://imgur.com/0vb8SAu)

Edit:

Interesting that the shape of ffmpeg-SSIM vs bitrate plot is quite different to that in the 1080/50p series and the differential between the VMAF and ffmpeg-SSIM scores is larger...

I'll maybe see how the 2160/50p and 1080/50p series compare when plotted against bits/pixel.

lansing

11th February 2019, 08:00

x265 still has the edge on bit savings, but not by as much.

I would say there's no advantage at all. The difference is like 0.03 between the two scores at crf 17. I thought it would be like a 5 or 6 point difference...

VS_Fan

11th February 2019, 17:49

So, if the aggregate VMAF scores are calculated with the first frame excluded (simple average across the remaining 499 frames), CRF=16 (392 Mbps) becomes the point at which VMAF=100 is reached in the x264 series, and CRF=17 (306 Mbps) in the x265 series ... x265 still has the edge on bit savings, but not by as much.Considering your criteria to obtain a 'transparent' encode, with x265 you are reducing the required bitrate from 392 (with x264) to 306 Mbps; you get 28% savings in storage space and/or streaming bandwidth, that's a significant amount!

ChaosKing

11th February 2019, 18:04

You save 22% not 28% which still is very good.

WorBry

11th February 2019, 18:25

For better (more valid ?) interpretation of these fine differences at high bitrates I'm thinking now it might have been prudent to run these tests with the 'Confidence Interval' (ci) option.

That said, I see that the log generates ci95_high, ci95_low and stddev values for the individual frames but does not derive aggregate values as I was expecting - according to the VDK documentation, the command line tool reports aggregate values:

https://github.com/Netflix/vmaf/blob/master/resource/doc/conf_interval.md

So is this a limitation of the libvmaf implementation ?

WorBry

12th February 2019, 07:17

I'll maybe see how the 2160/50p and 1080/50p series compare when plotted against bits/pixel.

The VMAF results:

http://i.imgur.com/RBzTFoB.png (https://imgur.com/RBzTFoB)

Perhaps not surprising, given that the vmaf_4k_v0.6.1 model predicts the subjective quality of video displayed on a 4KTV and viewed from the distance of 1.5 times the height of the screen whereas the vmaf_v0.6.1 model predicts the subjective quality of video displayed displayed a 1080p HDTV screen at distance 3 times the screen height.

In the 1080/50p series an aggregate VMAF=100 score was never attained for precisely the same reason - the VMAF score of the first frame skewed the aggregate score.

http://i.imgur.com/syJWJ0V.png (https://imgur.com/syJWJ0V)

As seen there, the maximum VMAF score achieved in the 1080 50p series was 99.947 with the lossless (crf0) x264 encode,

What intrigues me more are the FFMPEG-SSIM results:

http://i.imgur.com/cHsJnCW.png (https://imgur.com/cHsJnCW)

It's reasonable to assume that down-scaling of the original 2160/50p Crowd Run clip for the 1080/50p tests incurred some loss of fidelity in the 1080/50p source (and reference) clip, making it more 'compressible'. But why is the differential between the bit-matched 1080p and 2160p SSIM scores so much larger at 32-64 bits/pixel than it is down at around 6-8 bits/pixel ?

HolyWu

12th February 2019, 08:26

So is this a limitation of the libvmaf implementation ?

Yes. I just send a pull request to improve that.

WorBry

12th February 2019, 17:47

Great. Thanks.

https://github.com/Netflix/vmaf/pull/304

WorBry

12th February 2019, 23:42

I did record the libvmaf and ffmpeg PSNR scores also, but they are not as interesting.

Actually they are quite interesting.

PSNR scores obtained with the VapourSynth VMAF filter:

http://i.imgur.com/sSFW1rQ.png (https://imgur.com/sSFW1rQ)

60 is the maximum score, achieved only with the lossless x264 CRF=0 encodes.

And the equivalent ffmpeg PSNR results:

http://i.imgur.com/iEq4pkP.png (https://imgur.com/iEq4pkP)

I excluded the CRF=0 encode results because ffmpeg-PSNR reports lossless as Infinity (Inf).

The libvmaf PSNR scores are in general a little lower than the ffmpeg PSNR scores but show the same overall pattern. The 1080/50p series encodes gave higher bit-matched SSIM scores than the 2160/50p series at the higher bit/pixel range but at the lower end (< 24bits/pixel), that is reversed. Interesting also that the libvmaf PSNR metric gives wider separation of the x264 and x265 scores in the higher bit/pixel domain.

Iron_Mike

1st March 2019, 05:51

Update r2.

Scale 10-bit pixel values to 8-bit range for correct score calculation.
Use stricter linear frame request since VMAF score will change if frame order is different.
Report aggregate PSNR, SSIM, and MS-SSIM scores in addition to VMAF score.

could you elaborate on the first two points of this update ?

does that mean we have to scale both inputs to 8bit before calculating VMAF, or does VMAF do it automatically ?

I'm comparing a 16bit exr sequence to itself (as control) and I get a 98.2 VMAF score... I know that the FAQ states that this is normal, just making sure I do this correctly... most of the examples here use 8bit source/ref footage...

When I compare a 12bit 444 (yuv444p12le) CRF 10 x265 encode to the 16 bit exr ref footage I get a 96.4 VMAF score... little bit low considering the tests that WorBry has done...

Also, how do I use a "stricter linear frame request" ?

Thanks.

WorBry

1st March 2019, 15:19

could you elaborate on the first two points of this update ?

does that mean we have to scale both inputs to 8bit before calculating VMAF, or does VMAF do it automatically ?

.....Also, how do I use a "stricter linear frame request" ?

These are internal improvements that were made in update r2 - you don't need to do anything.

When I compare a 12bit 444 (yuv444p12le) CRF 10 x265 encode to the 16 bit exr ref footage I get a 96.4 VMAF score... little bit low considering the tests that WorBry has done...

Bear in mind though that all of those tests were done on a single source. Could be any number of factors weighing in there. What scores to you get if you get if you test the ref and x265 clips against themselves ?

poisondeathray

1st March 2019, 16:43

When I compare a 12bit 444 (yuv444p12le) CRF 10 x265 encode to the 16 bit exr ref footage I get a 96.4 VMAF score... little bit low considering the tests that WorBry has done...

I'm not sure how valid that test would be. EXR is usually sRGB linear , and 16bit half float .

For any metric you usually need a common ground to compare. This means same pixel format (same colorspace, same bit depth, same chroma subsampling) . Otherwise you introduce other variables that are not controlled for. e.g. if one run uses one algorithm to scale (e.g. bicubic vs. bilinear, vs...) , or another dithers down using one algorithm, but another does not... or if you convert to RGB using different matrix, etc... there are many factors that invalidate your testing

WorBry

1st March 2019, 22:12

Not to mention the potential for frame shifts/misalignment when using different decoders for the reference and test clips, although the filter will report an error if the number of frames is different.

Also needs to be appreciated that the VMAF models are 'trained' for predicting perceptual quality at streaming bitrates primarily. Came across this quote from Netflix:

"VMAF has been trained using encodes spanning from CRF 22 @ 1080 (highest quality) to CRF 28 @ 240 (lowest quality). The former is mapped to score 100 and the latter is mapped to score 20. Anything in between is mapped in the middle (for example, SD encode at 480 is typically mapped to 40 ~ 70)."

https://streaminglearningcenter.com/codecs/finding-the-just-noticeable-difference-with-netflix-vmaf.html

I would assume a similar focus was applied in training the 4K model.

So at CRF 10 you are well into uncharted territory.

Personally, I'd be more inclined to look at other metrics available for VapourSynth that are (maybe) better attuned for VQA in the visually lossless domain - GMSD, MDSI and yes, SSIM....Butteraugli, possibly. My own journey of discovery in that vein continues:

https://forum.doom9.org/showthread.php?t=176101

Iron_Mike

1st March 2019, 22:42

Bear in mind though that all of those tests were done on a single source. Could be any number of factors weighing in there. What scores to you get if you get if you test the ref and x265 clips against themselves ?

I ran all test w/ the ffmpeg libvmaf filter, but I assume since it uses the exact same VMAF models the result would be the same...

16bit EXR ref clip is 400 frames - control test to itself via 0.6.1 results in 98.2

x265 12bit 444 encode (from EXR), control tested to itself via 0.6.1 results in 98.08

x265 12bit 444 encode (from EXR) tested against it's source (16 bit EXR) via 0.6.1 results in 96.4

interesting that the clip you tested had 399 out of 400 frames a perfect 100 in the control test...

Iron_Mike

1st March 2019, 22:51

I'm not sure how valid that test would be. EXR is usually sRGB linear , and 16bit half float .

For any metric you usually need a common ground to compare. This means same pixel format (same colorspace, same bit depth, same chroma subsampling) . Otherwise you introduce other variables that are not controlled for. e.g. if one run uses one algorithm to scale (e.g. bicubic vs. bilinear, vs...) , or another dithers down using one algorithm, but another does not... or if you convert to RGB using different matrix, etc... there are many factors that invalidate your testing

EXR contains whatever you put into it (has nothing to do with sRGB) - there are no assumptions here: this is the master of the movie in HD in lossless 16 bit EXR, I took 400 frames as a test sequence

for streaming the movie was encoded via ffmpeg and x265 from EXR (rgb48le) to x265 12bit 444 (yuv444p12le) - final result looks very good, we're using VMAF to compare various encodes (presets/CRF/etc) against each other... exact same as NF does it w/ VMAF... the master one delivers to NF is obviosly also not 8bit 420, they encode from that (high quality) master for NF streaming...

so the "common ground" you state is the same movie in the same resolution, which is the only thing that NF states in their VMAF instructions...

the whole reason for comparison is different output bit depth w/ different output chroma subsampling, on top of different encoding settings, so I do not understand your point...

Iron_Mike

1st March 2019, 22:59

Not to mention the potential for frame shifts/misalignment when using different decoders for the reference and test clips, although the filter will report an error if the number of frames is different.

not sure what "different encoders" here relates to ?

ffmpeg reads an EXR frame and then reads a frame from the mp4 encode (that was done from the EXR via ffmpeg) and then passes the decoded frames to VMAF for comparison...

is there a setting to avoid frame shifts/misalignment ?

So at CRF 10 you are well into uncharted territory.

you yourself validated the test clip up to CRF 0 and results on the charts make sense... your encodes went up to VMAF 100... not sure I understand your concern for CRF 10 ?

Thanks

WorBry

1st March 2019, 23:09

..interesting that the clip you tested had 399 out of 400 frames a perfect 100 in the control test...

For that particular source, yes...actually it was 499 out of 500 frames that scored 100 - it was just that first frame with the motion2 score of 0 that skewed the aggregate score. But I've yet to test other sources.

Go through the list of per-frame VMAF scores from your 'self' tests and you'll be able to identify which frames are skewing the aggregate score.

poisondeathray

1st March 2019, 23:29

for streaming the movie was encoded via ffmpeg and x265 from EXR (rgb48le) to x265 12bit 444 (yuv444p12le) - final result looks very good, we're using VMAF to compare various encodes (presets/CRF/etc) against each other... exact same as NF does it w/ VMAF... the master one delivers to NF is obviosly also not 8bit 420, they encode from that (high quality) master for NF streaming...

so the "common ground" you state is the same movie in the same resolution, which is the only thing that NF states in their VMAF instructions...

the whole reason for comparison is different output bit depth w/ different output chroma subsampling, on top of different encoding settings, so I do not understand your point...

You are posting in the vapoursynth vmaf thread. Only certain pixel formats are supported.

https://github.com/HomeOfVapourSynthEvolution/VapourSynth-VMAF/

Clips to calculate VMAF score. Only YUV420P8, YUV422P8, YUV444P8, YUV420P10, YUV422P10, and YUV444P10 are supported.

Those are the common pixel formats supported by vmaf.VMAF .

My point is strive to be more scientific. To eliminate all those confounding variables in a controlled environment. How you perform the various conversions will affect the results that are calculated.

But now it's clear you're using ffmpeg vmaf. Did you look at the ffmpeg log to see what other conversions were occurring ? There might be other stuff going on behind your back

ffmpeg reads an EXR frame and then reads a frame from the mp4 encode (that was done from the EXR via ffmpeg) and then passes the decoded frames to VMAF for comparison...

is there a setting to avoid frame shifts/misalignment ?

Sometimes ffmpeg can "mix" up frames, less often with I-frame formats. But if your x265 encode used long GOP, there is a higher chance of a mixup than if it used I-frames only. EXR sequence will be I-frame only

For the vapoursynth , the source filter can be indexed, and is more robust method for frame accuracy. For ffmpeg you can reset the PTS which might help

you yourself validated the test clip up to CRF 0 and results on the charts make sense... your encodes went up to VMAF 100... not sure I understand your concern for CRF 10 ?

Look at the results WorBy has been posting . They all plateau off below crf 18 or so. crf16 has the "same quality" as crf 10 or crf 1 if you blindly believe VMAF. ie. Everything looks "the same" to VMAF at higher bitrate ranges. ie. It's not a useful metric for distinguishing between higher quality streams - only for streaming lowish bitrate delivery ranges

Iron_Mike

1st March 2019, 23:30

Go through the list of per-frame VMAF scores from your 'self' tests and you'll be able to identify which frames are skewing the aggregate score.

I did that, only a few frames out of the 400 are VMAF 100 - all others have scores in the 97-99 range (in the EXR control test), hence my question...

This all seems normal considering that NF themselves state that in their FAQs but when I saw you get a perfect 100 in 499/500 frames I thought maybe they've updated their model to make control tests perform close to 100...

WorBry

1st March 2019, 23:40

not sure what "different encoders" here relates to ?

I said 'decoders'.

Iron_Mike

1st March 2019, 23:42

Those are the common pixel formats supported by vmaf.VMAF.

is this for the VS flavor or for VMAF in general... where did you get this list ?

But now it's clear you're using ffmpeg vmaf. Did you look at the ffmpeg log to see what other conversions were occurring ? There might be other stuff going on behind your back

since I wasn't sure whether the bitrate was an issue (for the VMAF calculation), I converted the main and ref clips to yuv444p (8bit) before passing them into ffmpeg libvmaf (by specifying the -pix_fmt)... ffmpeg VMAF will tell the format it uses to compare in the console output, for my main/ref clips (16bit/12bit) it defaults to yuv444p10le, but once u pass clips in as 8bit it uses that format...

the VMAF score whether using original bit depth, 12 bit, 10 bit or 8 bit for the main/ref clips was always ~ 96.x (real test, not control)

Sometimes ffmpeg can "mix" up frames, less often with I-frame formats. But if your x265 encode used long GOP, there is a higher chance of a mixup than if it used I-frames only. EXR sequence will be I-frame only

GOP size on the encode is a fixed 48 frames, fps is 24

Look at the results WorBy has been posting . They all plateau off below crf 18 or so. crf16 has the same quality as crf 10 or crf 1 if you blindly believe VMAF. ie. Everything looks "the same" to VMAF at higher bitrate ranges. ie. It's not a useful metric for distinguishing higher quality - only for streaming lowish bitrate delivery ranges

well, or in other words:
those results could easily be interpreted that from a certain CRF on, the encode is perceptually identical, which is the whole point of VMAF...

their samples are based on humans reporting perceived quality differences...

ChaosKing

1st March 2019, 23:48

is this for the VS flavor or for VMAF in general... where did you get this list ?

Read the Readme: https://github.com/HomeOfVapourSynthEvolution/VapourSynth-VMAF/#usage

poisondeathray

1st March 2019, 23:50

is this for the VS flavor or for VMAF in general... where did you get this list ?

vapoursynth vmaf,

https://github.com/HomeOfVapourSynthEvolution/VapourSynth-VMAF/

since I wasn't sure whether the bitrate was an issue (for the VMAF calculation), I converted the main and ref clips to yuv444p (8bit) before passing them into ffmpeg libvmaf (by specifying the -pix_fmt)... ffmpeg VMAF will tell the format it uses to compare in the console output, for my main/ref clips (16bit/12bit) it defaults to yuv444p10le, but once u pass clips in as 8bit it uses that format...

the VMAF score whether using original bit depth, 12 bit, 10 or or 8bit for the main/ref clips was always ~ 96.x (real test, not control)

VMAF is probably less picky about those sorts of things, but it makes a significant difference on other metrics. There are a bunch of uncontrolled variables and operations there can cause wildly different results with other metrics - how it's scaled, dithering algo, etc...

well, or in other words:
those results could easily be interpreted that from a certain CRF on, the encode is perceptually identical, which is the whole point of VMAF...

their samples are based on humans reporting perceived quality differences...

Yes , that's a good way of phrasing it

I personally haven't used VMAF enough to be comfortable with it yet

I personally don't find that particularly useful. I guess it might be good enough for "joe public" , they might not be able to tell the difference. But you can bet people that deal frequently with encoding, codecs, compression ; ie. people that post here - they can tell the difference between say, a crf 10 vs. crf 18 encode.

Maybe a conspiracy theory, but it's almost like a Netflix scheme trying to justify their low delivery bitrate practices :devil:

Iron_Mike

2nd March 2019, 01:18

I personally don't find that particularly useful. I guess it might be good enough for "joe public" , they might not be able to tell the difference. But you can bet people that deal frequently with encoding, codecs, compression ; ie. people that post here - they can tell the difference between say, a crf 10 vs. crf 18 encode.

well "joe public" ultimately watches the content... I can tell the diff between CRF 10 and CRF 18 (frame per frame pixel peeping), but I also evaluate content on fully calibrated screens...

problem is w/ "scientific metrics" is that they often not relate a lot to the HVS (Human Vision System), which is the only thing that matters when humans watch the streamed content...

VMAF attempts to address that with their sample data... question always are if enough people were sampled, what kind of people (gender/age/race/ethicity - diff between European and Asian samples etc) and the sample procedure was done as best as possible...

Maybe a conspiracy theory, but it's almost like a Netflix scheme trying to justify their low delivery bitrate practices :devil:

hah ! probably the reason to start the project.. :cool:

poisondeathray

2nd March 2019, 02:11

well "joe public" ultimately watches the content... I can tell the diff between CRF 10 and CRF 18 (frame per frame pixel peeping), but I also evaluate content on fully calibrated screens...

It's an assumption . Audience might videophiles, or doom9ers, or you might be doing these tests for you

problem is w/ "scientific metrics" is that they often not relate a lot to the HVS (Human Vision System), which is the only thing that matters when humans watch the streamed content...

Yes, pros/cons to every measure , but there are other HVS modelled metrics.

VMAF attempts to address that with their sample data... question always are if enough people were sampled, what kind of people (gender/age/race/ethicity - diff between European and Asian samples etc) and the sample procedure was done as best as possible...

It's just that the RD curve characteristics limit VMAF's usefulness in some situations , since it's trained on higher CRF ranges.

So another way to phrase it - is the data set is not valid at higher bitrates. You cannot apply VMAF at higher bitrates because it was trained at CRF 22-28

WorBry

2nd March 2019, 02:18

VMAF attempts to address that with their sample data... question always are if enough people were sampled, what kind of people (gender/age/race/ethicity - diff between European and Asian samples etc) and the sample procedure was done as best as possible...

And what biases were introduced into the model by the choice of video codecs used in the subjective DMOS testing. Hmmm :cool:

https://forum.doom9.org/showthread.php?p=1867137#post1867137

Makes me wonder.

https://www.reddit.com/r/netflix/comments/9r75hm/netflix_starting_to_use_hevc_codec_for_hd_titles/

HolyWu

2nd March 2019, 04:18

Update r4.

Update libvmaf to v1.3.14, which reports aggregate CI scores and fix empty model name in log.

WorBry

2nd March 2019, 05:55

Cool. Should be interesting to see what statistical significance VMAF gives to those superfine score differences seen at the very high bitrates that I brought attention to earlier, which now, in light of the present discussion, I wish I hadn't ;)

https://forum.doom9.org/showthread.php?p=1865424#post1865424

Seeing that comment from Netflix changed my perspective somewhat:

"VMAF has been trained using encodes spanning from CRF 22 @ 1080 (highest quality) to CRF 28 @ 240 (lowest quality). The former is mapped to score 100 and the latter is mapped to score 20. Anything in between is mapped in the middle (for example, SD encode at 480 is typically mapped to 40 ~ 70)."

WorBry

4th March 2019, 07:08

Cool. Should be interesting to see what statistical significance VMAF gives to those superfine score differences seen at the very high bitrates that I brought attention to earlier....

I've tested the v4 update with the Crowd Run 1080/50p x264 (CRF 0 - 30) encodes I retained from the first tests with v3:

https://forum.doom9.org/showthread.php?p=1864770#post1864770

Here are the VMAF results, together with the aggregate 95% confidence interval (CI95_Low and CI95-High) scores i.e. the aggregate derived from the individual frame confidence intervals. I didn't generate the component SSIM, MS-SSIM and PSNR scores.

http://i.imgur.com/V2VKdry.png (https://imgur.com/V2VKdry)

http://i.imgur.com/MrYWJ8Q.png (https://imgur.com/MrYWJ8Q)

First thing to note is that the VMAF v4 scores are lower than the scores I obtained previously (with the exact same x264 encodes) with v3. The same default pool=1 (harmonic mean) setting was applied in both cases, so I can only assume this reflects changes in the VMAF model itself.

And homed in on the higher bitrate range.

http://i.imgur.com/SIR1U9O.png (https://imgur.com/SIR1U9O)

As noted in the v3 test series, the VMAF score for the lossless x264 CRF=0 encode (99.9954) didn't quite reach 100, and for the same reason - the component motion2=0 score for the first frame skewed an otherwise perfect 100 score for the other 499 frames.

I've yet to test the parallel x265 series with VMAF v4 but looking at the aggregate CI scores obtained with the x264 files I think I can confidently say that what minor differences were seen at the high bitrates in the first test series are not statistically significant. Seems odd though that the CI95_Low intervals for the CRF 22 - 30 encodes are actually smaller than those of CRF 20 despite being beyond the scope of the trained vmaf_v0.6.1.pkl model. Would have thought they would be larger. I suppose it depends on the content and quality of the source/reference video also.

Jamaika

4th March 2019, 07:48

I did my VMAF test on BPG files.
Vmaf is already embedded in the SVT encoder, not as a json file tester.
Pictures for I frames are better because they have a larger size by the same QP values for different encoders. And so much on the topic photos .
Ma should add codec X265 with VMAF metric .
http://forum.doom9.org/showthread.php?p=1867419#post1867419

WorBry

4th March 2019, 21:51

First thing to note is that the VMAF v4 scores are lower than the scores I obtained previously (with the exact same x264 encodes) with v3. The same default pool=1 (harmonic mean) setting was applied in both cases, so I can only assume this reflects changes in the VMAF model itself.

I see what the issue is now. When I ran the first series of tests with v3 I left it set for CI=False because it did generate the aggregate CI scores (only the per-frame CI scores). Now that aggregate CI scores are available in v4 I set CI=True which switches from 'vmaf_v0.6.1.pkl' model to 'vmaf_b_v0.6.3.pkl'. Testing the x264 series again with v4 and CI=False, the aggregate VMAF scores are exactly the same as those obtained previously with v3.

So actually the issue is that CI=True (model vmaf_b_v0.6.3.pkl) is giving lower aggregate VMAF scores than CI=False (model vmaf_v0.6.1.pkl) ! Why is that ? Surely they should be giving the same VMAF score ?

Edit: There was no mix-up of the 'model' folders , btw - when I updated to v4 I replaced the VMAF.dll and 'model folder' that came with it.

The above graphs amended accordingly:

http://i.imgur.com/qC0jFq2.png (https://imgur.com/qC0jFq2)

http://i.imgur.com/acrt2MK.png (https://imgur.com/acrt2MK)

http://i.imgur.com/tTaUoXu.png (https://imgur.com/tTaUoXu)

HolyWu

5th March 2019, 03:21

So actually the issue is that CI=True (model vmaf_b_v0.6.3.pkl) is giving lower aggregate VMAF scores than CI=False (model vmaf_v0.6.1.pkl) ! Why is that ? Surely they should be giving the same VMAF score ?

I don't remember seeing any official document mentioning that the VMAF score between non-CI model and CI model should be the same :confused:

WorBry

5th March 2019, 05:40

I guess the VMAF Confidence Interval doc does explain why there are differences:

https://github.com/Netflix/vmaf/blob/master/resource/doc/conf_interval.md

Note that the CI=False VMAF scores are within or at the limits of the CI95_High interval.

Edit: btw, testing the parallel series of x265 encodes with v4 CI=True gives exactly the same pattern.

HolyWu

6th March 2019, 02:52

I guess the VMAF Confidence Interval doc does explain why there are differences:

You are right. See #316 (https://github.com/Netflix/vmaf/issues/316).

WorBry

6th March 2019, 03:34

Thanks for raising the issue. I have a better understanding of what's going on now.:)

Iron_Mike

11th March 2019, 10:59

ran some control tests (same file validated to itself) via ffmpeg libvmaf and VS libvmaf to test whether both implementations return the same data.

since libvmaf only supports up to yuv444p10le, higher quality formats need to be down-converted - ffmpeg does that automatically.

4 sources used for the control tests: RGB48, yuv444p12le, yuv444p10le, yuv444p

VMAF SDK 1.3.14 - Model 0.6.1 - pool: mean

EXR RGB48 VMAF Note
ffmpeg 98.2549 converts internally to yuv444p10le
VS (1) 97.747 converted to yuv444p10le via FMTC
VS (2) 97.7475 converted to yuv444p10le via resize.bicubic

MP4 yuv444p12le VMAF Note
ffmpeg 98.1216 converts internally to yuv444p10le
VS (1) 97.7106 converted to yuv444p10le via FMTC
VS (2) 97.7101 converted to yuv444p10le via resize.bicubic

MP4 yuv444p10le VMAF Note
ffmpeg 98.1044 no conversion needed
VS 97.7144 no conversion needed

MP4 yuv444p VMAF Note
ffmpeg 97.7363 no conversion needed
VS 97.7363 no conversion needed

While it can be expected that the 16bit and 12bit sources will not return the same VMAF scores (ffmpeg internal down-conversion may not match the chosen VS conversion method), it is interesting to see that only w/ the 8-bit src the VMAF scores match.

The VMAF scores of the 10-bit src (although no conversion being done) still differ.

what is the reason for that ?

HolyWu

11th March 2019, 16:32

The VMAF scores of the 10-bit src (although no conversion being done) still differ.

what is the reason for that ?

Because FFmpeg filter doesn't normalize 10 bit to 8 bit like what Netflix does for calculation, hence the inconsistency.

Iron_Mike

11th March 2019, 22:14

Because FFmpeg filter doesn't normalize 10 bit to 8 bit like what Netflix does for calculation, hence the inconsistency.

which is the better approach

if a 10bit ref/src is provided. up-converting an encoded/distorted clip to 10bit does not lose precision, but down-converting a 10bit ref src to 8bit to then compare to the inferior 8-bit encode loses precision/accuracy...

NF does up-res a lower res encoded clip before comparing to the higher res ref/src (same logic), so this seems odd...

do you have a link to where they state that they downsample the master to 8bit ?

Thanks.

WorBry

13th March 2019, 04:48

Finally got around to re-testing the Crowd Run 2160/50p x264 series that I kept from earlier tests with v3:

https://forum.doom9.org/showthread.php?p=1865316#post1865316

So this was testing with VapourSynth VMAF v4 in Model=1 mode which uses vmaf_4k_v0.6.1 by default (CI=False) and vmaf_4k_rb_v0.6.2 when set to CI=True.

Now in this case CI=False and CI=True produced the exact same aggregate VMAF scores, which came as a surprise:

http://i.imgur.com/f2Py225l.png (https://imgur.com/f2Py225)

Now how is that ? The Confidence Interval doc doesn't mention 4K models specifically but I would assume the 'rb' in 'vmaf_4k_rb_v0.6.2' means 'residue bootstrapping', in which case why is residue bootstrapping used to derive CI scores for 4K video, whereas the CI model for HD/SD (vmaf_b_v0.6.3) uses plain bootstrapping ? All rather confusing.

HolyWu

16th March 2019, 04:12

do you have a link to where they state that they downsample the master to 8bit ?

Netflix doesn't explicitly mention that in the documentation. It's simply done this way in their source code.

Now in this case CI=False and CI=True produced the exact same aggregate VMAF scores, which came as a surprise:

I think the non-bootstrapping 4K model should have been named v0.6.2 rather than v0.6.1, as the model was released after VMAF algorithm v0.6.2. And the VMAF score won't be different between residue bootstrapping and plain bootstrapping. Only the CI-related scores will be affected.

WorBry

16th March 2019, 05:02

... And the VMAF score won't be different between residue bootstrapping and plain bootstrapping. Only the CI-related scores will be affected.

OK, but still - why in the 4K (2160/50p) tests does CI=True (vmaf_4k_rb_v0.6.2) give exactly the same aggregate VMAF scores as CI=False (vmaf_4k_v0.6.1), when in the 1080/50p tests CI=True (vmaf_b_v0.6.3.pkl) gave consistently lower aggregate VMAF scores than CI=False (vmaf_v0.6.1.pkl) ?

HolyWu

16th March 2019, 05:22

OK, but still - why in the 4K (2160/50p) tests does CI=True (vmaf_4k_rb_v0.6.2) give exactly the same aggregate VMAF scores as CI=False (vmaf_4k_v0.6.1), when in the 1080/50p tests CI=True (vmaf_b_v0.6.3.pkl) gave consistently lower aggregate VMAF scores than CI=False (vmaf_v0.6.1.pkl) ?

If vmaf_4k_v0.6.1 is actually trained with the same underlying environment as vmaf_4k_rb_v0.6.2, they are expected to have the same VMAF scores then. vmaf_v0.6.1.pkl was trained with different underlying environment compared to vmaf_b_v0.6.3.pkl, hence they don't give the same VMAF scores.

WorBry

16th March 2019, 06:03

So why don't they update the 'classic' non-bootstrapping HD/SD model, trained in the same environment as vmaf_b_v0.6.3, so that CI=False and CI=True produce the same aggregate VMAF scores as well? Surely it's important to have consistent outcomes ?

HolyWu

16th March 2019, 06:11

WorBry

16th March 2019, 06:38

Fair enough ;)

Iron_Mike

30th March 2019, 02:35

already posted this in another thread (https://forum.doom9.org/showpost.php?p=1870245&postcount=269), but thought I'd post VMAF results here as well

from a 16-bit EXR source (15 secs, 360 frames), made nine (9) x265 encodes, all CRF 10, in these formats (using Wolfberry ffmpeg build): yuv420p, yuv422p, yuv444p, yuv420p10le, yuv422p10le, yuv444p10le, yuv420p12le, yuv422p12le, yuv444p12le

VS VMAF results (sources were down-converted to yuv444p10, if higher, since that is the highest input format supported)

https://i.imgur.com/o5b4UNC.png

FFMPEG VMAF results (internally converts to yuv444p10, if higher source)

https://i.imgur.com/lK1bCMg.png

as you can see VMAF score indication is the same in both, but the SSIM/MS-SSIM differ... now besides that FFMPEG has that odd dip (scoring 8-bit higher than 10/12-bit), the VS VMAF results are almost flat...

so does VS VMAF internally convert everything to 8-bit (although it supports up to yuv444p10 input format) ?

poisondeathray

30th March 2019, 03:29

so does VS VMAF internally convert everything to 8-bit (although it supports up to yuv444p10 input format) ?

That's what HolyWu said, above - as per Netflix's source code

Netflix doesn't explicitly mention that in the documentation. It's simply done this way in their source code.

And a difference is that ffmpeg's vmaf implementation does not

HolyWu

30th March 2019, 03:33

as you can see VMAF score indication is the same in both, but the SSIM/MS-SSIM differ... now besides that FFMPEG has that odd dip (scoring 8-bit higher than 10/12-bit), the VS VMAF results are almost flat...

so does VS VMAF internally convert everything to 8-bit (although it supports up to yuv444p10 input format) ?

Yes. vmafossexec (the CLI of libvmaf) also does this normalization for 10-bit input. The normalized values are stored in floating-point, hence you needn't worry about precision lost. If you enable PSNR calculation in both VS libvmaf and FFmpeg libvmaf as well, you'll probably see bigger difference.

Iron_Mike

30th March 2019, 07:49

@HolyWu:

since everything gets down-converted to 8-bit internally, why are you guys not making 8-bit input mandatory in VS VMAF ?

I mean the 10-bit input support is pointless, and VS VMAF already requires same format for ref/dist, so the user is already required to convert in most cases before calling VS VMAF...

and btw, I mentioned this in the other thread:

when I use yuv444p (8-bit) as input format (coming from RGB48le) in VS VMAF compared to using yuv444p10 (10-bit), the range of VMAF/SSIM/MS-SSIM values is compressed (closer together)... since everything gets converted to 8-bit internally anyways, the range of values should pretty much be the same... unless the result of the filters I use to down-convert to 8-bit is substantially different to what VMAF uses internally... (I use fmtc or vs.resize)

alongside the other VMAF results, this is the result if I use 8-bit input w/ VS VMAF:

VS VMAF results (sources were down-converted to yuv444p)
https://i.imgur.com/KlXZtPd.png