Log in

View Full Version : What is current status for hardware H.265 encoding.


Pages : 1 2 3 4 [5] 6 7 8 9 10 11 12 13 14

CruNcher
11th February 2017, 22:29
Yeah cooling the higher 14nm Chips passively a no go with its base 15 TDP rating which can surely top out at 25W

First time we see HEVC B-frames coming out of a GPU implemented Encoder :D

could you try without b-frames and also a run with --avsw and could you reach with the fastest preset the AVC FPS best preset result ?

also don't forget to disable any power saving functions, though you would be fast throttled no matter what with that Fanless Design.

PS: Your balanced encodes banding result is very nice that's visible immediately, much lower banding perception in the sky scene :)

Overall the Decoding complexity is higher compared to my encode Nvidias Cuda Decoder spikes out a lot more under the 60 FPS.


I wonder if that banding result difference comes from a more efficient 10->8 bit Hardware Dithering conversion compared to the default --avsw in NVENCC

Input Info avsw: hevc(yv12(10bit))->nv12 [SSE2], 3840x2160, 60000/1001 fps

also indeed it's rather strange that NVENCC shows here yv12(10bit) instead of p010(10bit)

Does rigaya wrongly do p010->yv12->nv12 ?

easyfab
11th February 2017, 23:57
HEVC 8bit fastest preset without Bframes :

QSVEncC (x64) 2.62 (r1192) by rigaya, Jan 8 2017 23:11:24 (VC 1900/Win/avx2)
OS Windows 10 (x64)
CPU Info Intel Core m3-7Y30 @ 1.00GHz [TB: 1.61GHz] (2C/4T) <Skylake>
GPU Info Intel HD Graphics 615 (24EU) 300-900MHz [4W] (21.20.16.4589)
Media SDK QuickSyncVideo (hardware encoder) PG, 1st GPU, API v1.22
Async Depth 6 frames
Buffer Memory d3d9, 1 input buffer, 31 work buffer
Input Info avqsv video: HEVC, 3840x2160, 60000/1001 fps
VPP Enabled ColorFmtConvertion: nv12(10bit) -> nv12
Output HEVC main @ Level auto
3840x2160p 1:1 59.940fps (60000/1001fps)
avwriter: hevc => matroska
Target usage 7 - fastest
Encode Mode Bitrate Mode - VBR
Bitrate 23500 kbps
Max Bitrate 30000 kbps
QP Limit min: none, max: none
Trellis Auto
Ref frames 4 frames
Bframes none
Max GOP Length 600 frames
Scene Change off
Ext. Features PerMBRC

encoded 6642 frames, 38.97 fps, 22514.68 kbps, 297.41 MB
encode time 0:02:50, CPULoad: 26.93%
frame type IDR 1
frame type I 12, total size 3.07 MB
frame type P 6630, total size 294.34 MB

Quality is not that good and encoding time is slower without Bframes 38.97 vs 46.71 with Bframes

https://www.sendspace.com/file/mm5lj0

for --avsw : I stopped after 2 min but ~16fps

encoded 2187 frames, 15.90 fps, 55032.13 kbps, 239.36 MB
encode time 0:02:18, CPULoad: 80.07%

for fastest HEVC encoding on my little cpu

QSVEncC (x64) 2.62 (r1192) by rigaya, Jan 8 2017 23:11:24 (VC 1900/Win/avx2)
OS Windows 10 (x64)
CPU Info Intel Core m3-7Y30 @ 1.00GHz [TB: 1.49GHz] (2C/4T) <Skylake>
GPU Info Intel HD Graphics 615 (24EU) 300-900MHz [4W] (21.20.16.4589)
Media SDK QuickSyncVideo (hardware encoder) PG, 1st GPU, API v1.22
Async Depth 6 frames
Buffer Memory d3d9, 1 input buffer, 34 work buffer
Input Info avqsv video: HEVC, 3840x2160, 60000/1001 fps
VPP Enabled ColorFmtConvertion: nv12(10bit) -> nv12
Output HEVC main @ Level auto
3840x2160p 1:1 59.940fps (60000/1001fps)
avwriter: hevc => matroska
Target usage 7 - fastest
Encode Mode Bitrate Mode - VBR
Bitrate 24000 kbps
Max Bitrate 30000 kbps
QP Limit min: none, max: none
Trellis Auto
Ref frames 2 frames
Bframes 3 frames, B-pyramid: off
Max GOP Length 600 frames
Scene Change off
Ext. Features PerMBRC

encoded 6642 frames, 46.71 fps, 22262.31 kbps, 294.08 MB
encode time 0:02:22, CPULoad: 27.58%

with HEVC 10bit : it's slower ~35fps

CruNcher
12th February 2017, 00:08
Ohh your encode is 10 bits im surprised now it plays on the Cuda Decoder

ahh ok it wasn't it switched to avcodec that explains the complexity difference ;)

and so no 8 bit conversion was done at all that explains also the banding result difference.

FPS wise a nice result for such a Power Target :)

Yeah the CPU Decoding of that Bitstream lowers the overall Encoding Performance significantly


Decoding of that fastest result bitstream is extremely hard for Nvidias Cuda Decoder throughout the whole bitstream i reach only 30 fps at tops where i reach a lot of times 60 with my encode

could you try to reach 30 fps encoding with maybe 1 preset up TU 6 (fast) and no b-frames but with the 8 bit version

and make a balanced TU 4 version of that 8 bit to of course one with best preset would be a nice compare point :)

So 8 bit VBR version of

TU 1 = ??
TU 4 = ??
TU 6 = ??
TU 7 = 38.97 fps (available) https://forum.doom9.org/showpost.php?p=1796814&postcount=202

without B-frames and the FPS you achieve on your m3

JohnLai
12th February 2017, 03:41
Wait....KabyLake finally adds P-frame support for its HEVC?

EDIT:
easyfab,
Could you provide a small 1080p sample with TU1, ICQ=23 (Maybe test with ICQ-LA, who knows if Kaby supported it?), 10bit, HEVC, B-frames=6 or 8, Ref=6, B-pyramid=on (just input the command, if it doesnt work, it will fallback to off), Trellis=all.

CruNcher
12th February 2017, 05:36
That TU 7 encode has some heavier visual issues (failings) compared to my Nvidia encode and is overall more Decoding complex, TU 7 seems no good compare base at all especially fades fail.

Lets see what the SSIM result difference is and how that fade failing impacts it

Easyfabs TU 7 8 Bit Encode
SSIM

Y:0.939540 (12.185293)
U:0.959835 (13.961549)
V:0.955864 (13.552082)
All:0.945643 (12.647447)

Easyfabs TU 4 10 Bit Encode with B-frames

SSIM

Y:0.949048 (12.928351)
U:0.970515 (15.304042)
V:0.967422 (14.870695)
All:0.955688 (13.534776)

it reflects well in the visual stability and better banding result and all that at lower overall bitrate

easyfab
12th February 2017, 10:52
@JohnLai

la-icq -> not supported

2k sample (crowd_run) with your settings 55MB https://www.sendspace.com/file/dsswn6

B pyramid is not supported on current platform, disabled.
trellis is not supported on current platform, disabled.
cop.PicTimingSEI value changed off -> auto by driver
cop3.DirectBiasAdjustment value changed off -> auto by driver
cop3.GlobalMotionBiasAdjustment value changed off -> auto by driver
QSVEncC (x64) 2.62 (r1192) by rigaya, Jan 8 2017 23:11:24 (VC 1900/Win/avx2)
OS Windows 10 (x64)
CPU Info Intel Core m3-7Y30 @ 1.00GHz [TB: 1.61GHz] (2C/4T) <Skylake>
GPU Info Intel HD Graphics 615 (24EU) 300-900MHz [4W] (21.20.16.4589)
Media SDK QuickSyncVideo (hardware encoder) PG, 1st GPU, API v1.22
Async Depth 5 frames
Buffer Memory d3d9, 3 input buffer, 31 work buffer
Input Info y4m: yv12->nv12[AVX2], 1920x1080, 50/1 fps
VPP Enabled ColorFmtConvertion: nv12 -> nv12(10bit)
Output HEVC main10 @ Level auto
1920x1080p 1:1 50.000fps (50/1fps)
avwriter: hevc => matroska
Target usage 1 - best
Encode Mode ICQ (Intelligent Const. Quality)
ICQ Quality 23
QP Limit min: none, max: none
Trellis Auto
Ref frames 6 frames
Bframes 8 frames, B-pyramid: off
Max GOP Length 500 frames
Scene Change off

encoded 500 frames, 3.73 fps, 46430.36 kbps, 55.35 MB
encode time 0:02:14, CPULoad: 4.02%
frame type IDR 1
frame type I 1, total size 0.53 MB
frame type P 56, total size 12.43 MB
frame type B 443, total size 42.39 MB

Don't look at the speed I used wifi to access the source.



@CruNcher

There is only 3 speed presets because faster=fastest ....

I encoded the 500 first frames of Samsung_journey with :
- HEVC 8bit no Bframes preset fastest
- HEVC 8bit no Bframes preset balanced
- HEVC 8bit no Bframes preset best
- HEVC 10bit no Bframes preset best

I hope it's enough for you to test

https://www.sendspace.com/file/nex2x8

CruNcher
12th February 2017, 11:57
@JohnLai

la-icq -> not supported

2k sample (crowd_run) with your settings 55MB https://www.sendspace.com/file/dsswn6

B pyramid is not supported on current platform, disabled.
trellis is not supported on current platform, disabled.
cop.PicTimingSEI value changed off -> auto by driver
cop3.DirectBiasAdjustment value changed off -> auto by driver
cop3.GlobalMotionBiasAdjustment value changed off -> auto by driver
QSVEncC (x64) 2.62 (r1192) by rigaya, Jan 8 2017 23:11:24 (VC 1900/Win/avx2)
OS Windows 10 (x64)
CPU Info Intel Core m3-7Y30 @ 1.00GHz [TB: 1.61GHz] (2C/4T) <Skylake>
GPU Info Intel HD Graphics 615 (24EU) 300-900MHz [4W] (21.20.16.4589)
Media SDK QuickSyncVideo (hardware encoder) PG, 1st GPU, API v1.22
Async Depth 5 frames
Buffer Memory d3d9, 3 input buffer, 31 work buffer
Input Info y4m: yv12->nv12[AVX2], 1920x1080, 50/1 fps
VPP Enabled ColorFmtConvertion: nv12 -> nv12(10bit)
Output HEVC main10 @ Level auto
1920x1080p 1:1 50.000fps (50/1fps)
avwriter: hevc => matroska
Target usage 1 - best
Encode Mode ICQ (Intelligent Const. Quality)
ICQ Quality 23
QP Limit min: none, max: none
Trellis Auto
Ref frames 6 frames
Bframes 8 frames, B-pyramid: off
Max GOP Length 500 frames
Scene Change off

encoded 500 frames, 3.73 fps, 46430.36 kbps, 55.35 MB
encode time 0:02:14, CPULoad: 4.02%
frame type IDR 1
frame type I 1, total size 0.53 MB
frame type P 56, total size 12.43 MB
frame type B 443, total size 42.39 MB

Don't look at the speed I used wifi to access the source.



@CruNcher

There is only 3 speed presets because faster=fastest ....

I encoded the 500 first frames of Samsung_journey with :
- HEVC 8bit no Bframes preset fastest
- HEVC 8bit no Bframes preset balanced
- HEVC 8bit no Bframes preset best
- HEVC 10bit no Bframes preset best

I hope it's enough for you to test

https://www.sendspace.com/file/nex2x8

500 frames is a bit to low in the scene count and reactions it captures, especially of that very visible fade problems with TU7.

The first 500 frames are mostly black to color and straight cuts only but better then nothing thx anyways :)

TU7 though shows heavy problems with transitional blend fades.

And in all so far the Seeking behavior is a catastrophe, you should really enable scene change

What you can see in the first 500 frames immediately is that every preset works with a lower partition blocksize decision then Nvidias Encoder does.

Overall though it's not quiet fair to compare without working scenechange but yeah more finer details luminance as well as chrominance seem to be better preserved then my current encode uploaded even vs the fastest TU7 with it's overall more visual stability problems that pushes down the SSIM :)

CruNcher NVENCC (Maxwell)

http://i1.sendpic.org/t/kU/kU7pkD21hUTBaYLLdDJHEGixINX.jpg (http://sendpic.org/view/1/i/eTqrGhgqONmfMVY9C57mzv3Bd69.png)

Easyfab QSVENCC (Kabylake) TU7

http://i1.sendpic.org/t/3n/3nmp7l4SL8h8jRe6n7PliZtVrC2.jpg (http://sendpic.org/view/1/i/i7DpsljkPThhDasEaCcaJE2SzbN.png)

CruNcher NVENCC (Maxwell)

http://i1.sendpic.org/t/mD/mDVVPTgDDp2TF3mZsIpfcJcpk58.jpg (http://sendpic.org/view/1/i/avhEQtZcrUhVSxQPavR5jsqCBx5.png)

Easyfab QSVENCC (Kabylake) TU7

http://i1.sendpic.org/t/97/97q7dLFwjIu7UBycTdqk1nnNH2A.jpg (http://sendpic.org/view/1/i/czbAGVYrQW6PmCkrWljQgG0LvbL.png)

CruNcher NVENCC (Maxwell)

http://i1.sendpic.org/t/jK/jKjWVTHAnOjz1ePtCpXQsW5EAA9.jpg (http://sendpic.org/view/1/i/7d2hW3SiPP2IjCgoJqqi8YDpFdM.png)

Easyfab QSVENCC (Kabylake) TU7

http://i1.sendpic.org/t/hq/hqMCakYj8kQCdh5SW6F20ICe2Xm.jpg (http://sendpic.org/view/1/i/7ka8kz9EJDaoT2pcCwIARAsxTHv.png)


Though somehow it looks that overall the 10->8 bit conversion was also more efficiently done inside of QSVENCC compared to --avsw.

There is a really big visual perceptive difference in the luminance and resulting chrominance also, which results in a complete different lighting result of shadow scenes.

Shadow areas become brighter on Easyfabs Intel Output results like they are indirect lighten.

JohnLai
12th February 2017, 14:18
@JohnLai

la-icq -> not supported

2k sample (crowd_run) with your settings 55MB https://www.sendspace.com/file/dsswn6



Analysis....
Contrary to the log detail on P-frame.......There is no P-FRAME in the bitstream, still GPB style as usual....unless if one views those P-frame detail in the log as Generalised B-frame.


https://k55.imgup.net/ActualB-Frf7cb.PNG
There are 8 actual B-frames in the bitstream. However, only 4 active frames being referred by currently displayed B-frame. Guess the limit for refs is 4. (Please have a look on DPB "Used by current"


https://b06.imgup.net/PGeneralisab20.PNG
Next, for P-frame (a B-frame that is treated as P-frame in Generalized B-Frame mode), this P-frame can only refer up to 3 reference frames.


EDIT:
No SAO support.
Intra PU sizes
4x4 8x8 16x16 32x32
Inter PU sizes
4x8 8x4 8x8 8x16 16x8 16x16 32x32

GrandPa
14th February 2017, 16:27
here Check-features : qsvencc 2.62 show api 1.19 but it's 1.22 so I d'ont know if all features are correct ?
10 bit Depth has an X but I can encode in 10bit with --profile main10.
Does qsvencc need an update to show new api features?



Quite strange results. On my machine it reports <Kabylake> for my Skylake CPU:

QSVEncC (x64) 2.62 (r1192) by rigaya, Jan 8 2017 23:11:24 (VC 1900/Win/avx2)
reader: raw, avi, avs, vpy, avqsv [H.264/AVC, HEVC, MPEG2, VC-1, VP8, VP9]
Environment Info
OS : Windows 10 (x64)
CPU: Intel Core i7-6700K @ 4.00GHz [TB: 4.19GHz] (4C/8T) <Kabylake>
RAM: Used 2400 MB, Total 16261 MB
GPU: Intel HD Graphics 530 (24EU) 1150MHz (21.20.16.4590)

Media SDK Version: Hardware API v1.19


But besides that, there is just one single difference in features compared to your post. It reports 10bit capability for my HD530:

Codec: HEVC
CBR VBR AVBR QVBR CQP VQP LA LAHRD ICQ LAICQ VCM
10bit depth o o x x o o x x o x o


The rest is identical. Hmmm, maybe wait for an QSVEncC update to see the real feature list of KabyLake?

(I just updated the display driver to v.4590, with older versions, feature list differs a little more)

NikosD
15th February 2017, 19:21
We got an explanation from AMD of what VBAQ means.

"VBAQ stands for “Variance Based Adaptive Quantization”.

The basic idea of VBAQ:
Human visual system is typically less sensitive to artifacts in highly textured area.

In VBAQ mode, we use pixel variance to indicate the complexity of spatial texture.

This allows us to allocate more bits to smoother areas.
Enabling such feature leads to improvements in subjective visual quality with some content."

From tests, I have seen that there is no speed impact using this feature.

Also, new VCEENC v3.05 is out fixing HEVC CBR/VBR mode, so I will upload Samsung Journey HEVC encoding by Polaris 8bit HEVC encoder.

NikosD
15th February 2017, 20:22
The last VCEEnc v3.05 has partially fixed the HEVC VBR bug, you can't still go more than 20000 Kbps.

So, my Samsung Journey HEVC encoding using VBAQ and VBR 20000, Balanced quality, pre-analysis is here:
https://www.sendspace.com/file/clckxm

The file is ~258MB < 300MB due to max VBR 20000

CruNcher
16th February 2017, 11:54
Target is far off from the 299 we basically agreed upon, so i would see your encode as handicaped not reaching that goal ;)
Hmm not sure if it's the VBAQ but perceptively scene changes are very problematic to high quantized before they become stable in your encode even fast cuts.

So overall perceptive stability is fluctuating the heaviest on the first sighting in motion currently but it needs to be seen as handicapped as well compared to easyfabs or my result which hit the target bitrate and have a clear distribution advantage.

Y:0.933980 (11.803249)
U:0.961267 (14.119180)
V:0.957400 (13.705949)
All:0.942431 (12.398135)

NikosD
16th February 2017, 12:49
So overall perceptive stability is fluctuating the heaviest on the first sighting in motion currently but it needs to be seen as handicaped as well compared to easyfabs or my result which hit the target bitrate and have a clear distribution advantage.


Picture/ video quality is sometimes very objective and because I can't see anything of what you say, can you point min:sec of the video to me for all the low quality parts of the video ?

CruNcher
16th February 2017, 12:52
I think we wont need that because the bitstream here can be very nicely separated into scenes with naming conventions :)

I would say the first scenes are pretty uniform on all encoders if we dont count percepted sharpness at all.

The first scene where visual impactfull differences become visible is the "Walking on the Edge" in the Sky you might see Banding or not, though all of our encodes are banding more or less in 8 bit :)
Then you have the transition to the "CGI Chrome Window with the Sun" where the transition is very problematic visually in your encode followed by the transition to the "Standing forest(jungle) looking into the Sky" where the transition is also problematic as it shows a to high latency from blurry to sharp which makes the impression of a DOF change but there is no DOF to beginn with.
Then comes the "Walking in the Rosefield" transition also shows problems from the "Walking on the Green Grass" scene and followed by the "Big Rose to Face Closeup" transition shows also problems really heavy prediction problems in the blend both are heavily percepted.
Then the "Rose bloom flying out of the CGI Chrome Window" shows the "DOF" effect in it's transition again.

KabyLakes TU7 has also problems with those blend transitions but not really as heavy as in your Encode result, but i count your encode result as handicapped for now as i said.


What is also really interesting your Encode shows the same Chroma/Luminance Level result as mine only Easyfabs KabyLake QSVENC result differs heavily from it's overall brightness perception.
And in those regards also our Banding result is very identical Easyfabs differs significantly as well.

"Walking on the Edge" = 00:20.000
"CGI Chrome Window Sun" = 00:27.000
"Standing forest(jungle) looking into the Sky" = 00:29.000
"Walking on the Green Grass" = 00:57.000
"Walking in the Rosefield" = 01:00.000
"Big Rose to Face Closeup" = 01:08.000
"Rose bloom flying out of the CGI Chrome Window" = 01:34.000

I didn't analyze it frame by frame but it looks like they're some heavy prediction errors maybe chroma related, that i absolutely don't percept as bad the same way in Easyfabs TU7 Kabylake or my Nvidia Maxwell Encode result, including that blurry->sharp "DOF" effect on some cuts and transitions.

easyfab
16th February 2017, 18:39
To complete the collection

Encode with :
HEVC 8bit
BF 0
Preset T4
Scene change ( I needed to use --avsw for that ) Is it better ?

https://www.sendspace.com/file/8ttalh

CruNcher
16th February 2017, 20:01
It becomes slowly indistinguishable between yours and mine on the first sight there is still the banding and brightness difference and i still find my transitional blend results at a closer look better (more stable) but overall it becomes really close without looking into the details on your TU4 result in terms of perceptable stability.

Seeking wise yes much better, the seeking results also practically match now, though your seeking time is still a tad higher it seems to the final result but you are using mkv and me mp4 so this could be a reason together with overall tad higher decoding complexity.

A typical Blend Transition difference between our encode results

Easyfab QSVENCC TU4
http://i1.sendpic.org/t/df/dfKSEG8CwXTncpwuerZmaPIYgXP.jpg (http://sendpic.org/view/1/i/a8oVdqXQqvQHRwdo6D4Xy8vTQGU.png)
CruNcher NVENCC
http://i1.sendpic.org/t/uV/uVPDfUMLvFgBVz7MphNRjFHPLyo.jpg (http://sendpic.org/view/1/i/irRdKXG8GVp8ErcyoX3XsFjcaKt.png)

After the transition

Easyfab QSVENCC TU4
http://i1.sendpic.org/t/if/ifqIZMAG4DC6K28qxyCYahtLmBD.jpg (http://sendpic.org/view/1/i/8sajVi2fA1qgx4zqB3Tc7fnBI9u.png)
CruNcher NVENCC
http://i1.sendpic.org/t/sn/snhwI1GaypuGl0FcHSkVYYMKa18.jpg (http://sendpic.org/view/1/i/83vTNgJdDXRNTgp6nSmfnP3tBSh.png)

After TU4 i guess your result gonna beat Nvidia Maxwell overall and become the reference like your Best result did pretty much before in 10 bit

And as soon as this happens wee neeed a Pascal user, though i still try to squish a little bit more out in terms of overall encoding speed lose, so i waiting for the result that destroys mine now completely from you, though overall you beating my system to death already with that power output efficiency doing this on your fanless laptop ;)

For completeness

Easyfab QSVENCC TU4

SSIM

Y:0.941868 (12.355859)
U:0.963671 (14.397485)
V:0.960298 (14.011911)
All:0.948574 (12.888148)


lets recompare that with NikoSD AMD VCEEncC Encode which i personally also find perceptively overall inferior in it's current state going into the direction of even unacceptable in terms of stability

Y:0.933980 (11.803249)
U:0.961267 (14.119180)
V:0.957400 (13.705949)
All:0.942431 (12.398135)

CruNcher
17th February 2017, 18:29
@NikosD
Did you tried a retranscode of your current VCENCC result without that VBAQ ?

easyfab
17th February 2017, 18:49
@CruNcher

Thnaks for the test

QSVENCC TU4 Bframes 0
All:0.948574 (12.888148)

QSVENCC TU 4 10 Bit Encode with B-frames
All:0.955688 (13.534776)

As 10bit shouldn't give so must improvement, Bframes is a real advantages for intel VS Nvidia and AMD.

Now I hope than 1 day some mores features like look-ahead will be add.

NikosD
17th February 2017, 19:34
@NikosD
Did you tried a retranscode of your current VCENCC result without that VBAQ ?

Here you are:
https://www.sendspace.com/file/g52pd0

CruNcher
17th February 2017, 22:44
I guess Nvidias Spatial AQ and AMDs VBAQ will be pretty much the same as Fionas VAQ spatial

A small Metric lose overall most issues stay the same like in the VBAQ enabled result so VBAQ isn't at least responsible for those stability problems.

NOVBAQ

SSIM

Y:0.932798 (11.726161)
U:0.961508 (14.146268)
V:0.957671 (13.733606)
All:0.941728 (12.345423)

JohnLai
18th February 2017, 05:39
@CruNcher
AMD VBAQ still doesn't work.......nor is preanalysis......

NikosD
18th February 2017, 09:05
The first scene where visual impactfull differences become visible is the "Walking on the Edge" in the Sky you might see Banding or not, though all of our encodes are banding more or less in 8 bit :)
Then you have the transition to the "CGI Chrome Window with the Sun" where the transition is very problematic visually in your encode followed by the transition to the "Standing forest(jungle) looking into the Sky" where the transition is also problematic as it shows a to high latency from blurry to sharp which makes the impression of a DOF change but there is no DOF to beginn with.
Then comes the "Walking in the Rosefield" transition also shows problems from the "Walking on the Green Grass" scene and followed by the "Big Rose to Face Closeup" transition shows also problems really heavy prediction problems in the blend both are heavily percepted.
Then the "Rose bloom flying out of the CGI Chrome Window" shows the "DOF" effect in it's transition again.

KabyLakes TU7 has also problems with those blend transitions but not really as heavy as in your Encode result, but i count your encode result as handicapped for now as i said.


What is also really interesting your Encode shows the same Chroma/Luminance Level result as mine only Easyfabs KabyLake QSVENC result differs heavily from it's overall brightness perception.
And in those regards also our Banding result is very identical Easyfabs differs significantly as well.

"Walking on the Edge" = 00:20.000
"CGI Chrome Window Sun" = 00:27.000
"Standing forest(jungle) looking into the Sky" = 00:29.000
"Walking on the Green Grass" = 00:57.000
"Walking in the Rosefield" = 01:00.000
"Big Rose to Face Closeup" = 01:08.000
"Rose bloom flying out of the CGI Chrome Window" = 01:34.000

I didn't analyze it frame by frame but it looks like they're some heavy prediction errors maybe chroma related, that i absolutely don't percept as bad the same way in Easyfabs TU7 Kabylake or my Nvidia Maxwell Encode result, including that blurry->sharp "DOF" effect on some cuts and transitions.

Sorry, but I can't see differences between NVIDIA, AMD and the original source.

I have already posted a free link before for JohnLai, where you can upload and compare still images by hovering your mouse on them.

Maybe it would be better if we could compare still images that way, in order to find out differences that I can't see.

(the sun is very bright here, probably I have to test again in complete dark room with screen light only)

CruNcher
18th February 2017, 11:09
Interesting they're actually even more issues but i don't count them because they are visually the same in all 3 encodes "Walking on the Edge" = 00:20.000 is one of those even if it comes very differently out in easyfabs result then our encodes.

Ok ill show you the heaviest stability issues that get immediately perceived by me on your encode result as being disturbing braking the visual perception flow for myself in motion compared to easyfabs my encode result and the source, here we go.

"CGI Chrome Window Sun" = 00:27.000
http://i1.sendpic.org/t/il/ilvPQNzOEpg5uAg3Wkc0XccOaVB.jpg (http://sendpic.org/view/1/i/nqUtq00xH1bozeHpYJP3K23ve5A.png)

Looks pretty much like the Keyframe itself is total mess and the whole thing recovers slowly

"Standing forest(jungle) looking into the Sky" = 00:29.000
http://i1.sendpic.org/t/42/42ADsUCfvvXJAbml2Z9jwiCvKD4.jpg (http://sendpic.org/view/1/i/kXCSb3zGRKw8ikwnYzbgdgJ0wif.png)


This here is actual the hardest where i feel i would have big problems explaining someone why this plays out bad in motion when he just sees that on the first look super nice spatial result, and it is the way the disolve itself plays out at it's peak creating strange darker looking blocks inside of the transition.
Though im rather surprised you didn't perceive it the same way as not comfortable compared to the other results and way different even.

"Walking on the Green Grass" = 00:57.000
"Walking in the Rosefield" = 01:00.000
http://i1.sendpic.org/t/mz/mzcbTD5cWgJlzPeFoFGC2sUxChs.jpg (http://sendpic.org/view/1/i/gVcIA69guQ2ohCHFW8n5zNLPuzA.png)

This is hopefully pretty obvious and im really heavily surprised you didn't perceived it as uncomfortable compared again to the other results.

"Big Rose to Face Closeup" = 01:08.000
http://i1.sendpic.org/t/hE/hEjLV2UPkbv2jFfHyhmU1Lh1DZm.jpg (http://sendpic.org/view/1/i/4hNNgYyxxoknBFcWilmd6ChodEF.png)

Also again Keyframe messy and recovering slowly with prediction errors visible

"Rose bloom flying out of the CGI Chrome Window" = 01:34.000
http://i1.sendpic.org/t/yy/yy4Atj0hkop0lgEfm4EZ6e7PF33.jpg (http://sendpic.org/view/1/i/1VnEMB2SeMspUPyDITOy3Leccfl.png)


These are the things that make me feel there is a lot going wrong in your encode in stability on the encoder side and im not sure if even more bits would help really.

NikosD
19th February 2017, 16:16
@JohnLai and CruNcher

What is the difference in terms of quality only, between Maxwell 2nd generation HEVC encoder vs Maxwell GM206 vs Pascal ?

HEVC quality only, not speed at all.

CruNcher
19th February 2017, 22:09
Between 2nd Maxwell and GM206 virtually 0 between both and Pascal it's SAO and with it i guess the Encoder could beat Intels without it ;)

i try now to get closer to easyfabs TU1 result and see if i can somehow avoid to leave the 300 mb limit and how that will come out perceptually and hoping to stay in the 0.5x range ;)

But so far Nvidia doesn't do overly bad especially staying close on the Stability side of that balanced TU4 without B-frames is a nice achievement so far :)

Though if Intel further optimizes that will be gone fast as well ;)

JohnLai
20th February 2017, 04:04
Exactly the same as CruNcher said ~ Smooth All Object ~

Between maxwell and pascal, pascal encode quality is highly dependent on scene type due to SAO. (SAO can't be turned off)
Example, with SAO, blue sky is smooth and look 'natural'. But SAO causes somehow blurry seawater wave (or any fluid based) movement.......

NikosD
21st February 2017, 21:30
"CGI Chrome Window Sun" = 00:27.000
http://i1.sendpic.org/t/il/ilvPQNzOEpg5uAg3Wkc0XccOaVB.jpg (http://sendpic.org/view/1/i/nqUtq00xH1bozeHpYJP3K23ve5A.png)

Looks pretty much like the Keyframe itself is total mess and the whole thing recovers slowly



No problem here.


"Standing forest(jungle) looking into the Sky" = 00:29.000
http://i1.sendpic.org/t/42/42ADsUCfvvXJAbml2Z9jwiCvKD4.jpg (http://sendpic.org/view/1/i/kXCSb3zGRKw8ikwnYzbgdgJ0wif.png)


This here is actual the hardest where i feel i would have big problems explaining someone why this plays out bad in motion when he just sees that on the first look super nice spatial result, and it is the way the disolve itself plays out at it's peak creating strange darker looking blocks inside of the transition.
Though im rather surprised you didn't perceive it the same way as not comfortable compared to the other results and way different even.


No problem here.


"Walking on the Green Grass" = 00:57.000


No problem here.


"Walking in the Rosefield" = 01:00.000
http://i1.sendpic.org/t/mz/mzcbTD5cWgJlzPeFoFGC2sUxChs.jpg (http://sendpic.org/view/1/i/gVcIA69guQ2ohCHFW8n5zNLPuzA.png)

This is hopefully pretty obvious and im really heavily surprised you didn't perceived it as uncomfortable compared again to the other results.



Yes, this is the only one I can see difference, but if you hadn't mentioned that, I wouldn't see it at all.


"Big Rose to Face Closeup" = 01:08.000
http://i1.sendpic.org/t/hE/hEjLV2UPkbv2jFfHyhmU1Lh1DZm.jpg (http://sendpic.org/view/1/i/4hNNgYyxxoknBFcWilmd6ChodEF.png)

Also again Keyframe messy and recovering slowly with prediction errors visible


No problem here.


"Rose bloom flying out of the CGI Chrome Window" = 01:34.000
http://i1.sendpic.org/t/yy/yy4Atj0hkop0lgEfm4EZ6e7PF33.jpg (http://sendpic.org/view/1/i/1VnEMB2SeMspUPyDITOy3Leccfl.png)


No problem here.

CruNcher
22nd February 2017, 22:07
Text is always for the lower spatial output result

But interesting if the Decoding is different and you don't see that prediction errors especially the colored blocks even going through frame by frame in the "Rose bloom flying out of the CGI Chrome Window" Scene something goes wrong.

Im also pretty sure by now Rigaya took some heavy bad decisions on his encoder side default tuning for NVENC on perceptual ground i find his decisions questionable, though this encode sample shows that not so heavily as other will do.

Even if Metrics want me to tell otherwise.

But im still behind that and trying to understand why Rigaya does what he does on NVENCC ;)

CruNcher
24th February 2017, 18:18
My highest 23 Mbits Transcoding speed result so far of 60 FPS Hevc 4K Transcoding 8 bit Decoding(CPU)/Encoding(GPU/SIP)

0.64x

NikosD
25th February 2017, 16:04
This is the same "Samsung Journey" file transcoded to HEVC 8 bit by VCEEnc v3.05v2, but this time using Avisynth LWLibav for dithering/demuxing/decoding inside latest StaxRip.

I don't know if it has any difference from CLI VCEEnc v3.05v2 using ffmpeg SW demuxer/decoder.

I'm a little blind regarding to those small differences :)

https://www.sendspace.com/file/bubcxt

CruNcher
26th February 2017, 08:49
subjective indeed but a colored block that doesn't belong initially in the scene at all is no small difference for me personally and a quantization that fluctuates so wide and recovers slowly also isn't to be rated "small difference" imho.

Small artifacts here and there are overall acceptable but not such complete fails.

NikosD
26th February 2017, 10:00
My highest 23 Mbits Transcoding speed result so far of 60 FPS Hevc 4K Transcoding 8 bit Decoding(CPU)/Encoding(GPU/SIP)

0.64x

Using latest VCEEnc v3.05v2 as a benchmark tool, which leverages AMF API for HW decoding/ encoding, I did some tests today on the HW H.264 decoder/encoder of Polaris RX 470 card.

It seems that the HW H.264 encoder doesn't change its encoding speed by using different rate modes (CBR/VBR, CQP) or different bitrate for CBR/VBR (maximum bitrate for VBR/CBR H.264 encoding is 100Mbps and above 700Mbps for CQP)

The one and only parameter that affects HW encoding speed is quality.

So, for 1080p H.264 clip the encoding speed is:

fast -> ~140 fps

balanced -> ~ 100 fps

slow -> ~ 55 fps

The HW H.264 decoder can keep up with the fast quality setting(~140 fps at 1080p) with clips at 60Mbps.

After that limit, the HW H.264 decoder drops its speed reaching ~95fps at 1080p100Mbps H.264 clip.

My Core i5 2400 can reach 1080p at 140fps when bitrate is 20Mbps.

So, although the Polaris HW H.264 decoder seems a little slow, it's still 3 times faster than Core i5 2400.

Can you test your Maxwell 2nd gen HW H.264 decoder/encoder that way ?

Similar HEVC tests are not feasible at the moment, as the quality parameter for HEVC is still not working (it has the same speed for all settings slow/balanced/fast)

CruNcher
26th February 2017, 19:10
i'll look into it after the HEVC part

Btw could you test something ?

Could you show me your Systems/Decoder/Render Jitter response when the wave is crashing down once AMD DXVA and Once CPU (Decoder of Choice) ?

http://i1.sendpic.org/t/yh/yhO3mo0GcHIhbqaaOXmaeo7QV5p.jpg (http://sendpic.org/view/1/i/roFjjM2d36Uedz2nyZFXWrlw9Jk.png)

https://www.sendspace.com/file/cr4n8n


PS: That StaxRip Transcode didn't come out any better then the direct VCEEnc it suffers from the same Encoder Issues.


Want to slowly leave The Jorurney ;)

Sony A7

Sony XAVC->Nvidia HEVC

https://www.sendspace.com/file/5l4m3v


This should also Play very well on GM204 Hybrid Accelerated via CUVID and DXVA CopyBakck not so good via DXVA (native), maybe though even there if you use the new "Optimize for Compute" option in the newer Drivers same for Cyberlink HAM though better then DXVA native ;)
So if you have to much dropped frames issues switch to DXVA Copy Back,CUVID,HAM and/or try that new Optimize for Compute Option.

CruNcher
28th February 2017, 10:16
One of the the very firs Public Encoder Streams Retranscoded at Half the Bitrate

DivX/Mainconcept->Nvidia Hevc

Speed = 1.7x

Perceptually partly really interesting should be 2nd or 3rd Generation now

https://www.sendspace.com/file/050xe5

PSNR

y:42.833979
u:46.333830
v:47.580143
average:43.819431
min:18.154593
max:73.970364


SSIM

Y:0.987446 (19.012031)
U:0.985141 (18.280104)
V:0.990760 (20.343423)
All:0.987614 (19.070676)

NikosD
28th February 2017, 14:26
i'll look into it after the HEVC part



Reading Anandtech's R9 285 H.264 HW decoder results here http://www.anandtech.com/show/8460/amd-radeon-r9-285-review/4, I can see that it's a little faster than 750 Ti.

I think that the HW H.264 decoder of R9 285 is the same like mine, RX 470 and that 750 Ti has the same Purevideo decoder (VP6) like yours Maxwell 2nd generation.

So, I updated my table of H.264 benchmarks here https://forum.doom9.org/showthread.php?p=1712350#post1712350 with some useful/ strange comments here https://forum.doom9.org/showthread.php?p=1798958#post1798958 regarding Polaris HW H.264 decoder.

Is it possible to test your VP6 HW H.264 decoder on those clips ?

You can find all samples here ftp://helpedia.com/pub/multimedia/x264/testvideos/2011%20-%2002%20-%20H.264%20CPU%20DXVA%20codec%20comparison%20-%20Core2Duo%20vs%20UVD%202.2/ and here ftp://helpedia.com/pub/multimedia/x264/testvideos/2012%20-%2001%20-%20QuickSync%20vs%20UVD%202.2%20vs%20VP4/

JohnLai
28th February 2017, 14:32
This should also Play very well on GM204 Hybrid Accelerated via CUVID and DXVA CopyBakck not so good via DXVA (native), maybe though even there if you use the new "Optimize for Compute" option in the newer Drivers same for Cyberlink HAM though better then DXVA native ;)
So if you have to much dropped frames issues switch to DXVA Copy Back,CUVID,HAM and/or try that new Optimize for Compute Option.

Strange.:confused:
There is no hybrid HEVC CUVID decoding as far as I know.
CUVID is designed for pure fixed function hardware utilization.

NikosD
28th February 2017, 14:39
CUVID uses DXVA2, so if DXVA2 implementation is hybrid then CUVID is hybrid too.

If it's pure fixed-fuction, then it's fixed-fuction.

It depends on DXVA2

JohnLai
28th February 2017, 15:35
CUVID uses DXVA2, so if DXVA2 implementation is hybrid then CUVID is hybrid too.

If it's pure fixed-fuction, then it's fixed-fuction.

It depends on DXVA2

:confused:
Nope, I don't think CUVID uses DXVA2.
CUVID is supposed to be different in implementation than DXVA2.
Beside, for all HEVC 8bit samples with GM204, LAVFilter displays "active decoder : avcodec"

If it works, LAVFilter should displays "active decoder : cuvid"

NikosD
28th February 2017, 15:42
I'm under the impression that nevcairiel, the developer of LAV filters, had once told me that NVCUVID is like a wrapper of DXVA and has nothing to do with CUDA.

It's been years since.

I could remember falsely.

nevcairiel
28th February 2017, 16:00
CUVID can work in two modes, either direct hardware access or through DXVA. In DXVA mode, CUVID can access the hybrid DXVA decoders, but that mode doesn't work on Windows 10, only the pure CUVID mode works on 10, which only has fixed-function decoders.

JohnLai
28th February 2017, 16:05
CUVID can work in two modes, either direct hardware access or through DXVA. In DXVA mode, CUVID can access the hybrid DXVA decoders, but that mode doesn't work on Windows 10, only the pure CUVID mode works on 10, which only has fixed-function decoders.

Thanks for the clarification.
I am using win10, so that is the reason on why I only see "avcodec" when decoding 8bit hevc using GM204.
:goodpost:

CruNcher
28th February 2017, 22:46
Overall it also makes no sense the Hybrid Decoder will die and CPU Multithreading be more efficient alone (especialy on a relative low piower one) at a certain complexity range to be system efficient with it you would need to dynamically switch decoders based on the overall bitstream complexity,or get a nice async flow with lowest CPU overhead. ;)

In that regards HEVC is a tad better Parallelizeable but not really by that much that it could threaten the CPU or ASIC efficiently yet.

NikosD
1st March 2017, 04:20
Vega has Virtualized Encode feature, which according to Anandtech means:


AMD has implemented an optimized video encoding path for virtualized environments on their GPUs.

A game streaming service requires that the contents of upwards of several virtual machines be encoded quickly, so this would be the logical next step for AMD’s on-board video encoder (VCE) by making it efficiently work with virtualization.


There is also a session with Mikhail Mironov in GDC
http://schedule.gdconf.com/session/true-audio-next-and-multimedia-amd-apis-in-games-and-vr-applications-development-presented-by-amd

CruNcher
1st March 2017, 10:33
We will also discuss research projects based on Advanced Media Framework (AMF). The key focus of this presentation is integration with game engines for better VR immersion.

Makes sense to virtualize it also for Content Protection Security Reasons, most can run pretty nicely Async on GCN then virtualized it should practically never interfere with the Game Threads and causing lock situations like it happens still to often currently.

Also it fits to the virtualization of the new Shader Model 6 running via LLVM

And overall it fits to the Security Model that was being on the Roadmap since 2006 now ;)

JohnLai
1st March 2017, 12:27
Virtualized Encode eh....
Sounds good. But can AMD actually make sure its VCE software stack actually works?
Both rigaya and xaymar have headache with AMD AMF implementation.

NikosD
1st March 2017, 12:31
That Mikhail Mironov guy who is a lead developer of AMF API and replies on github, looks like he is doing a good job.

AMF is far better than previous AMD MediaSDK.

Next version will polish bugs and make features work, I think.

CruNcher
1st March 2017, 23:40
Though surprisingly he's not from the MSU ;)

CruNcher
4th March 2017, 14:54
using the Hardware Decoder/Encoder same time both NVENC Engines (Cores) at 50% (half their utilization maximum)

Pre CUDA Maxwell Driver Sheduler optimization (3D tasks AERO/DWM and Firefox)

0.931x (57 FPS)

Almost reached stable 60 FPS for UHD now @ 20 Mbits X264->Nvidia HEVC (~-3 FPS)

http://i1.sendpic.org/i/bP/bPW89uDcWkoLwoDorkuSi10Zg0V.png

~50W idle inc peripherals USB3/2 ...
~120W encoding

Overall ~70W for NVDEC/NVENC + CUDA +CPU +Memory (System/Storage) together

Maximum possible System output is somewhere ~350-380W


Node 0 = 3D/Shader Compiler
Node 1 = Cuda/Compute
Node 2 = ?
Node 3 = NVDEC
Node 4 = ?
Node 5 = ?
Node 6 = Copy
Node 7 = NVENC
Node 8 = NVENC
Node 9 - 14 = ?

Almost 60 FPS Decoding Stable (only Intra Frame Latency issues resist) on the Hybrid (GM204) CUDA Decoder (i guess most problems will be solved with the new Driver Compute Sheduler mode)

8 Bit Decoding Complexity is around 3 Sandy Bridge Cores and around 1.5 with the GM204 CUDA Decoder (so halved).

http://i1.sendpic.org/i/2T/2TBEHSsEIOg6tAvjgdZDGbncDtR.png

Zetto
4th March 2017, 18:33
anyone expects 1080 ti to have an updated nvenc engine?

sneaker_ger
4th March 2017, 18:45
Weren't video capabilities in the past related to the code(name) of the chip? GTX 1080 Ti is supposed to be GP102 like Titan X so I would not expect new capabilities. Nor VP9 10 bit (like GTX 1050 (Ti) aka GP107) for that matter. GTX 980 Ti also had less capabilities than GTX 960 despite the later release date.