View Full Version : Difference in PSNR and SSIM calculations between x264 and MSU
ramprasad85
8th October 2009, 08:19
E:\raw_videos>x264 -B 512 foreman_176x144_300.yuv -o NUL --dump-yuv fore512.yuv --psnr --ssim
x264 [info]: 176x144 (given by file name) @ 25.00 fps
x264 [info]: using cpu capabilities: MMX2 SSE2 Cache64
x264 [info]: profile High, level 2.0
x264 [info]: slice I:2 Avg QP:16.95 size: 9926 PSNR Mean Y:45.68 U:48.21 V:49.22 Avg:46.43 Global:44.69
x264 [info]: slice P:165 Avg QP:15.08 size: 3882 PSNR Mean Y:46.12 U:48.03 V:48.93 Avg:46.76 Global:46.46
x264 [info]: slice B:133 Avg QP:19.38 size: 724 PSNR Mean Y:43.91 U:47.19 V:48.01 Avg:44.82 Global:44.67
x264 [info]: consecutive B-frames: 16.1% 68.5% 14.1% 1.3%
x264 [info]: mb I I16..4: 0.0% 24.2% 75.8%
x264 [info]: mb P I16..4: 0.1% 2.0% 2.1% P16..4: 31.0% 30.4% 33.6% 0.0% 0.0% skip: 0.8%
x264 [info]: mb B I16..4: 0.0% 0.1% 0.2% B16..8: 30.7% 8.3% 12.7% direct:11.3% skip:36.8% L0:21.9% L1:27.1% BI:50.9%
x264 [info]: final ratefactor: 11.85
x264 [info]: 8x8 transform intra:41.9% inter:44.6%
x264 [info]: coded y,uvDC,uvAC intra:98.0% 99.4% 93.6% inter:59.0% 60.8% 30.0%
x264 [info]: ref P L0 81.6% 11.3% 7.2%
x264 [info]: ref B L0 90.3% 9.7%
x264 [info]: SSIM Mean Y:0.9931860
x264 [info]: PSNR Mean Y:45.139 U:47.656 V:48.523 Avg:45.898 Global:45.559 kb/s:504.48
encoded 300 frames, 88.89 fps, 504.91 kb/s
Using "MSU video Quality Measurement Tool" version 2.01 Beta, Average YPSNR was "44.73495" and Average YSSIM(precise) was 0.99192
MSU's PSNR formula can be found here. http://www.compression.ru/video/quality_measure/info_en.html#psnr
How come the values are different?
Dark Shikari
8th October 2009, 08:33
x264 uses a method similar to MSU's "fast" SSIM calculation mode. The blocks are overlapped differently specifically to avoid the pitfall of the SSIM calculation blocks matching x264's DCT blocks.
For PSNR, there will be a slight discrepancy as x264 doesn't deblock unreferenced frames (e.g. B-frames) unless you use --dump-yuv. However, you are using --dump-yuv, so there is most likely something wrong with your decoding somewhere.
Also, if you're trying to optimize for PSNR or SSIM, use --tune psnr or --tune ssim, respectively. By default, x264's psy optimizations will greatly decrease both.
ramprasad85
8th October 2009, 10:31
ok, while encoding with "-b 0" and "--keyint 1", the psnr and ssim values are quiet close (though not equal).
one observation:
in all I frames mode the PSNR and SSIM values are following a strange pattern. (The "insert image" is not working so I have attached it). its like saw tooth! what is reason for this behavior?
E:\raw_videos>x264 -o NUL --dump-yuv fore_512_intra.yuv foreman_176x144_300.yuv --keyint 1 --psnr --ssim
x264 [info]: 176x144 (given by file name) @ 25.00 fps
x264 [info]: using cpu capabilities: MMX2 SSE2 Cache64
x264 [info]: profile High, level 1.1
x264 [info]: slice I:300 Avg QP:28.81 size: 2828 PSNR Mean Y:35.74 U:41.24 V:42.30 Avg:36.97 Global:36.76
x264 [info]: mb I I16..4: 2.7% 50.4% 46.9%
x264 [info]: 8x8 transform intra:50.4% inter:-1.$%
x264 [info]: coded y,uvDC,uvAC intra:88.2% 80.6% 48.5% inter:-1.$% -1.$% -1.$%
x264 [info]: SSIM Mean Y:0.9571695
x264 [info]: PSNR Mean Y:35.736 U:41.238 V:42.300 Avg:36.973 Global:36.764 kb/s:565.59
encoded 300 frames, 146.56 fps, 573.36 kb/s
http://img527.imageshack.us/img527/1732/foreman512intra.jpg
Dark Shikari
8th October 2009, 10:35
To begin with, you should update to the latest x264... and host your images on an external site (e.g. imageshack) so you don't have to wait for approval on your uploaded images.
ramprasad85
8th October 2009, 10:50
Sorry about that,
With the latest exe there is no such pattern. its works perfectly.
RunningSkittle
8th October 2009, 18:24
Also no point in using --ssim AND --psnr, pick one or the other!
jakor
9th October 2009, 01:59
x264 uses a method similar to MSU's "fast" SSIM calculation mode. The blocks are overlapped differently specifically to avoid the pitfall of the SSIM calculation blocks matching x264's DCT blocks.
is there a need to do "fast" SSIM now ?
I did a fullpicture SSIM calculation in pure C and it is less than 1 ms. Spend 1 hour more, add SSE and time taken for the calculation is reduced by an order.
Dark Shikari
9th October 2009, 02:04
is there a need to do "fast" SSIM now ?
I did a fullpicture SSIM calculation in pure C and it is less than 1 ms. Spend 1 hour more, add SSE and time taken for the calculation is reduced by an order.You likely didn't do the "precise" method.
"Precise" SSIM calculation involves performing analysis in a gaussian-weighted window around every single pixel separately.
Let us say that we use a 10x10 window (and zero everything beyond that). That's 10*10*1920*1080*8 floating point operations, or just over 1.65 gigaflops for one 1080p frame.
jakor
9th October 2009, 02:39
I did like it is on http://en.wikipedia.org/wiki/Structural_similarity
taken the window of a picture size.
The result was something of "true-like".
And how to transfer from frame-basis SSIM to stream basis ? Is this measure additive ?
"Let us say that we use a 10x10 window (and zero everything beyond that). That's 10*10*1920*1080*8 floating point operations, or just over 1.65 gigaflops for one 1080p frame."
Comparing to x264 encoding one frame - how many flops are taken ? i assume not less than 50-100 times higher. What do you think ?
Dark Shikari
9th October 2009, 02:47
"Let us say that we use a 10x10 window (and zero everything beyond that). That's 10*10*1920*1080*8 floating point operations, or just over 1.65 gigaflops for one 1080p frame."
Comparing to x264 encoding one frame - how many flops are taken ? i assume not less than 50-100 times higher. What do you think ?First of all, x264 doesn't even use "flops" at all, as video encoding is nearly entirely integer math. But if we consider clock cycles, x264 uses up to 10 times fewer clock cycles per frame than that.
cogman
9th October 2009, 05:06
First of all, x264 doesn't even use "flops" at all, as video encoding is nearly entirely integer math. But if we consider clock cycles, x264 uses up to 10 times fewer clock cycles per frame than that.
Interesting, This is off topic, I've always thought (and wrongly so apparently) that video encoding was mainly floating point code.
I guess I arrived at that assumption based on the fact that I thought that SSE/SSE2 instructions were used quite heavily. Aren't SSE instructions primarily floating point instructions (I'm not challenging you as I know you know what you're talking about.)
jakor
9th October 2009, 05:49
interesting... taken that in a 1080 p picture is 8100 mb, and overall clock cycles is about 165 million, it is 20 thousand clock cycles per mb.
This number is a kind of doubtful for me.. taken motion estimation, dct idct quant deblock and all that computationally intensive functions....
I think it is at least 10 times bigger. And it's easy to check. (and also depends highly on preset) rdtsc and that's it.
also taken into account that SSIM floating calculations may be effectively done thru SSE floating support - and calculated faster usual integer math (not using SSE) that can reduce number of flops needed significantly.
jakor
9th October 2009, 05:50
Interesting, This is off topic, I've always thought (and wrongly so apparently) that video encoding was mainly floating point code.
I guess I arrived at that assumption based on the fact that I thought that SSE/SSE2 instructions were used quite heavily. Aren't SSE instructions primarily floating point instructions (I'm not challenging you as I know you know what you're talking about.)
SSE/SSE2 etc works both way - either packed float (double) or packed integer.
Dark Shikari
9th October 2009, 06:18
Interesting, This is off topic, I've always thought (and wrongly so apparently) that video encoding was mainly floating point code.
I guess I arrived at that assumption based on the fact that I thought that SSE/SSE2 instructions were used quite heavily. Aren't SSE instructions primarily floating point instructions (I'm not challenging you as I know you know what you're talking about.)Remember here, we're processing 8-bit input samples.
So if you need to do a motion search, you have to search those 8-bit integer samples. No float.
Need to perform motion interpolation? Well, you're working on those 8-bit samples. No float.
Need to transform the residual? Well, the differences between predicted and actual samples are 9-bit signed values. Still not float.
(repeat ad infinitum)
Avoiding float is good, because integer is faster. There's generally no need for the kind of range that float offers--even if one needs fractional precision, fixed point usually does the same job better.interesting... taken that in a 1080 p picture is 8100 mb, and overall clock cycles is about 165 million, it is 20 thousand clock cycles per mb.
This number is a kind of doubtful for me.. taken motion estimation, dct idct quant deblock and all that computationally intensive functions....
I think it is at least 10 times bigger. And it's easy to check. (and also depends highly on preset) rdtsc and that's it.
also taken into account that SSIM floating calculations may be effectively done thru SSE floating support - and calculated faster usual integer math (not using SSE) that can reduce number of flops needed significantly.Well, 1.65 billion cycles per frame is about 200,000 cycles per macroblock. This is (from my memory, don't have exact numbers on me) more than enough to run the x264 default settings, which include multiref motion estimation, RD optimization, trellis quantization, and so forth.
This probably scales down to under 20,000 cycles for --preset ultrafast.
(It also scales up to over 2 million in --preset placebo).
SSE does not make floating point faster than integer, since you can fit 16 8-bit ints (or 8 16-bit ints) in an SSE register, while you can only fit 4 floats. And of course the latency and throughput on floating point instructions is much worse than integer.
akupenguin
9th October 2009, 06:29
Let us say that we use a 10x10 window (and zero everything beyond that). That's 10*10*1920*1080*8 floating point operations, or just over 1.65 gigaflops for one 1080p frame.
I calculate (assuming you have a multiply-accumulate instruction, and a fast approximate division (i.e. rcpss counts as one flop), and ignoring the fact that only a few of these are actually floating-point):
3 flops per pixel to do the multiplies in the inner loop
4*2*9 flops per pixel to do the separable 10x10 gaussian blur on the 4 intermediate sums
14 flops per pixel to combine the above into a per-pixel ssim score
1 flop per pixel to sum over all pixels
= 195 Mflops per 1080p frame
Dark Shikari
9th October 2009, 06:33
I calculate (assuming you have a multiply-accumulate instruction, and a fast approximate division (i.e. rcpss counts as one flop), and ignoring the fact that only a few of these are actually floating-point):
3 flops per pixel to do the multiplies in the inner loop
4*2*9 flops per pixel to do the separable 10x10 gaussian blur on the 4 intermediate sums
14 flops per pixel to combine the above into a per-pixel ssim score
1 flop per pixel to sum over all pixels
= 195 Mflops per 1080p frameOK, so I was overestimating the cost (I was also assuming a naive C-only implementation...).
But the point stands: it's very non-trivial compared to the actual time spent encoding.
jakor
9th October 2009, 07:25
"SSE does not make floating point faster than integer, since you can fit 16 8-bit ints (or 8 16-bit ints) in an SSE register, while you can only fit 4 floats. And of course the latency and throughput on floating point instructions is much worse than integer."
comparing apples to apples. Let's assume X264 takes 200K per mb clock cycles - which are ALREADY mixed SSE and simple integer arithmetics. You can not put more integers in SSE registers as it is now.
and we have 30 k flops per mb for SSIM, which are easily reduced by factor 4 using SSE - 7.5.
4% of the default settings mode. more or less, if input is correct.
"OK, so I was overestimating the cost"
very nice overestimating approach, BTW, overestimate opponent numbers by a factor of 10, and underestimate your point's number also by factor of 10.
Dark Shikari
9th October 2009, 07:29
"SSE does not make floating point faster than integer, since you can fit 16 8-bit ints (or 8 16-bit ints) in an SSE register, while you can only fit 4 floats. And of course the latency and throughput on floating point instructions is much worse than integer."
comparing apples to apples. Let's assume X264 takes 200K per mb clock cycles - which are ALREADY mixed SSE and simple integer arithmetics. You can not put more integers in SSE registers as it is now.
and we have 30 k flops per mb for SSIM, which are easily reduced by factor 4 using SSE - 7.5.
4% of the default settings mode. more or less, if input is correct.SSE instructions are not magically free even if you're not using SSE registers. Modern x86 chips only have three arithmetic units (at least what can be used simultaneously).
Akupenguin's numbers were already calculated with the assumption of SIMD.
Furthermore, 30kflops per MB is enough time to run half a dozen RD calls. We're not going to waste that on SSIM calculation."OK, so I was overestimating the cost"
very nice overestimating approach, BTW, overestimate opponent numbers by a factor of 10, and underestimate your point's number also by factor of 10.If I am now your "opponent", this discussion is over and you are now welcomed to my ignore list.
jakor
9th October 2009, 07:36
DS, I suggest don't take this matter as a personal offense, I'm just pointing out the obvious mistake in your words, which has some tendancy ;-)
"Akupenguin's numbers were already calculated with the assumption of SIMD"
"I calculate (assuming you have a multiply-accumulate instruction, and a fast approximate division (i.e. rcpss counts as one flop), and ignoring the fact that only a few of these are actually floating-point):
3 flops per pixel to do the multiplies in the inner loop
4*2*9 flops per pixel to do the separable 10x10 gaussian blur on the 4 intermediate sums
14 flops per pixel to combine the above into a per-pixel ssim score
1 flop per pixel to sum over all pixels"
rcpSS - is single precision, it is not a packed instruction.
Every line contains word FOR PIXEL, where do you see in this post correspondence to using SIMD ?
Do I read something wrong ?
P.S. when someone opposes his point or opinion to one's point or opinion - he is called an opponent. My point was - that full SSIM calculation is just a small fraction of the whole x264 encoding process,
your opinion - that it is not. This is what usually called opposing.
Sorry for the offtopic.
vBulletin® v3.8.5, Copyright ©2000-2012, Jelsoft Enterprises Ltd.