Log in

View Full Version : Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing


Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [17] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

STaRGaZeR
15th February 2012, 20:52
After all, the important part is playback (at least for me), and if i can benchmark it at 300 fps, doing 24 fps in playback (or even 60 fps) is done at nearly no load at all.

Maximum speed is important for seeking too. Seeking with ATI's DXVA can be a pain (max. is only ~2x playback speed), while it's fast as hell with QS or software.

CruNcher
15th February 2012, 21:41
@ Egur
That MFX driver thing is definitely not UDA style like im used to hehe
so you have different directories and the same size .dll that get installed but based on Hardware Detection in the installer :D so the driver will get pretty big over the time with always cloning the same components for different hardware revisions with haswell their will be 3 by then, though ILK is not in that driver release anymore

That leaked Driver also seems to have OpenCL 1.1 support :)

Though what is strange is that they're actually 2 different libmfxhw revisions for both platforms i wonder how the installer makes the decission which to copy in the common files folder for general application use and why 2 revisions i guess 1 of those 2 fixes the VC-1 interlaced decoding issue ;)

Current libmfxhw 18 Oct 2011

2.11.10.18
2.0.556.36397


New Driver libmfxhw 1 16 Dez 2011

3.11.12.16
3.0.253.38506

New Driver libmfxhw 2 19 Jan 2012

3.12.1.19
3.0.255.38772

both seem api version 1.03
the old one api version 1.01

RBG
15th February 2012, 22:49
Because it's still used widely in TV broadcasting and for DVDs (you know those optical discs that still outsell Blu-Ray by quite a large margin)?

So what? How do you think what percentage of people are using their PC for DVD playback? Compared to all that huge amount of sold DVD's they are absolute minority. And the idea of watching DTV on a computer sounds just ridiculous to me. :p


You ask why people want to benchmark a format that is still widely used for video?

Well, on PC it is not that widely used nowadays compared to h.264 or vc-1, moreover MPEG-2 itself is not a power hungry format, even an entry-level SB cpu can handle it easily without any kind of hardware accelerated decoding.

CruNcher
15th February 2012, 23:15
I got it (pretty windows standard actually) ;)
seems Asus as Vendor was not in the list i added my Subsys Hardware ID and the installer accepts it ;)

Jep seems to be a OEM ISV Driver

SLAOEMISV1/RBK/01-21-00 <- Software License Agreement OEM ISV

Voila installed ;)

http://img851.imageshack.us/img851/1314/leakdriverinstall.png

New Driver libmfxhw 2 19 Jan 2012

3.12.1.19
3.0.255.38772

was chosen by the installer


New Optimal 3D mode

http://img11.imageshack.us/img11/7656/optimal3dmode.png

could be something like Catalyst AI or some Optimization for Specific Engines (Games,Applications)

Design Order of the PP functions changed a bit

http://img195.imageshack.us/img195/363/designorder.png

2622 WPI :

http://img580.imageshack.us/img580/4057/2622.png

2626 WPI :

http://img836.imageshack.us/img836/8617/2626.png

GPU-Z still a no go

http://img27.imageshack.us/img27/1636/gpuzcapsviewer.png


Fixed as expected:

http://www.mediafire.com/?0ljab8afkp1jm47

Mixer73
16th February 2012, 01:29
So what? How do you think what percentage of people are using their PC for DVD playback? Compared to all that huge amount of sold DVD's they are absolute minority. And the idea of watching DTV on a computer sounds just ridiculous to me. :p

Maybe I'm wierd but I watch MPEG2 on my computer every day. We still have MPEG2 DVB-T broadcast and I have dual head so I can watch TV on one screen and do other stuff on the other inbetween watching. More comfortable than using a laptop on the couch.

Well, on PC it is not that widely used nowadays compared to h.264 or vc-1, moreover MPEG-2 itself is not a power hungry format, even an entry-level SB cpu can handle it easily without any kind of hardware accelerated decoding.

See above. However I get Nev's comment that MPEG2's simplicity makes it not the best tool for comparitive performance benchmarking. However its still a very important and oft-used format.

NikosD
16th February 2012, 20:52
@ Egur

That leaked Driver also seems to have OpenCL 1.1 support :)





GPU-Z still a no go

http://img27.imageshack.us/img27/1636/gpuzcapsviewer.png



I have told you before, but you didn't get it.

There is no such thing as HW OpenCL support for SandyBridge GPU.

It's only software support (CPU).

Did you actually test HW OpenCL support with the new driver ?

ramicio
16th February 2012, 20:53
I had to turn off "full floating point processing" in the renderer settings, and it works now.

CruNcher
16th February 2012, 21:01
I have told you before, but you didn't get it.

There is no such thing as HW OpenCL support for SandyBridge GPU.

It's only software support (CPU).

Did you actually test HW OpenCL support with the new driver ?

I said its OpenCL nothing more :) though it's anyway surprising don't you think that they provide the OpenCL support for the CPU with the GFX driver ;)

NikosD
16th February 2012, 21:25
OpenCL support was there from the beginning.

What is the change with the new driver ?

OpenCL usually is paired with GPU because it's a lot faster than CPU (for the specific tasks)

And because Ivy will support HW OpenCL 1.1, they probably put it in GPU drivers from the beginning.

As a matter of fact, where else could they put it ?

GPU is inside CPU.

CPU and GPU drivers, go together.

CruNcher
17th February 2012, 05:35
Going way to oftop

https://forum.doom9.org/showthread.php?p=1559021

RBG
17th February 2012, 06:38
We still have MPEG2 DVB-T broadcast and I have dual head so I can watch TV on one screen and do other stuff on the other inbetween watching.

I've got a hardware decoder for that stuff, and it is way better than a PC in terms of power consumption and reliability.;)


See above. However I get Nev's comment that MPEG2's simplicity makes it not the best tool for comparitive performance benchmarking. However its still a very important and oft-used format..

As for me, there should be more VC-1 testing, as it really requires hw acceleration more than other formats, mostly because it lacks decent software decoders.

NikosD
17th February 2012, 12:12
"According to Digitimes, a weak global economy has caused a build-up of Sandy Bridge inventory both at Intel and OEMs.

If Intel went ahead and mass released Ivy Bridge in April, these Sandy Bridge parts would have to be thrown away or sold at much lower prices.

Now the plan is to release some Ivy Bridge chips in April, but postpone mass shipments (presumably of consumer-oriented parts) until after June.

System builders should still be able to get their hands on some mid- and high-end Ivy Bridge chips in April."

nevcairiel
17th February 2012, 12:24
All that means is that the low to mid-end dual core chips will be delayed. I can still get me a quadcore to build my high-end HTPC. :D

CruNcher
17th February 2012, 16:02
Nah i wouldn't buy Ivy Bridge i'll wait for Haswell the amazement will be much heavier by then, even if its a big change having the first Tri Gate Processor but i want the even more advanced one not the first one :D ;)

Though since the first days i was always jumping forth and back from AMD/INTEL (with a short stay with Cyrix) and i guess not much will change and currently im in the INTEL stage again, though i want to see how Fusion Develops too and so i guess it could happen that the Next Bulldozer Generation is in between the Haswell shift ;)

STaRGaZeR
17th February 2012, 19:37
Eric, take a look at this post (http://forum.doom9.org/showpost.php?p=1559102&postcount=9154). Forcing nOutputQueueLength in the QS config to 8 seems to fix it, but maybe there's more to it, related to your recent changes.

EDIT: after some time there is stutter here and there even with 8, just like with 16 but sporadic.

nevcairiel
17th February 2012, 20:04
Hey Eric,

it seems like the QS decoder isn't particularly happy with H264 in Annex B format (MEDIASUBTYPE_H264). I've been trying to improve rtp/rtsp streaming, and i figured with such a volatile stream it might be beneficial to let the splitter keep it in its original form (which is H264 AnnexB), however when i feed that to your QS decoder through LAV Video, i get nothing (only black screen, it doesn't seem to output any decoded frame)
Any ideas? I can try to upload a test build that does this so you can try to reproduce.

egur
17th February 2012, 23:00
Eric, take a look at this post (http://forum.doom9.org/showpost.php?p=1559102&postcount=9154). Forcing nOutputQueueLength in the QS config to 8 seems to fix it, but maybe there's more to it, related to your recent changes.

EDIT: after some time there is stutter here and there even with 8, just like with 16 but sporadic.
I can't test live TV, but Nev's point on the high delay during playback affecting live TV makes sense. Long queues provide good performance and good frame rate calculations. The latter isn't important for LAV as he disables the feature.
In order to reduce latency in live playback, the queues need to be much shorter. Since my decoder doesn't know the context, maybe the fix should be in the DirectShow decoder filter (LAV, ffdshow).
@Nev, any suggestions?

Hey Eric,

it seems like the QS decoder isn't particularly happy with H264 in Annex B format (MEDIASUBTYPE_H264). I've been trying to improve rtp/rtsp streaming, and i figured with such a volatile stream it might be beneficial to let the splitter keep it in its original form (which is H264 AnnexB), however when i feed that to your QS decoder through LAV Video, i get nothing (only black screen, it doesn't seem to output any decoded frame)
Any ideas? I can try to upload a test build that does this so you can try to reproduce.
I have a single clip with this subtype (MEDIASUBTYPE_H264) so my testing so fat was limited (works fine BTW).
The HW decoder only accepts this sort of stream. For AVC1 I have to convert it to such a stream.
I can try to reproduce, sure. Please supply a problematic clip. I don't think I need a your build unless you change the stream somehow.
Also, does it work with another splitter (Haali)?

nevcairiel
17th February 2012, 23:01
The HW decoder only accepts this sort of stream. For AVC1 I have to convert it to such a stream.
I can try to reproduce, sure. Please supply a problematic clip. I don't think I need a your build unless you change the stream somehow.
Also, does it work with another splitter (Haali)?

Its a streaming source, i cannot make it into a clip (and Haali doesnt support streaming protocols)

nevcairiel
17th February 2012, 23:11
I can't test live TV, but Nev's point on the high delay during playback affecting live TV makes sense. Long queues provide good performance and good frame rate calculations. The latter isn't important for LAV as he disables the feature.
In order to reduce latency in live playback, the queues need to be much shorter. Since my decoder doesn't know the context, maybe the fix should be in the DirectShow decoder filter (LAV, ffdshow).
@Nev, any suggestions?

Well how do your queues work? Does it not output anything until they are full?
Is it maybe possible to decrease the delay without cutting into the performance so drastically? (Using queues at 0 is quite the performance impact right now)

Specifying a context would require somehow listing all sorts of TV applications, which would be an impossible task.

egur
18th February 2012, 10:35
Without being able to debug, I can't root cause the decode problem (H264 fourcc). If there's a streaming setup I can run using a web source and not a live TV source then please specify the full setup and I'll debug it.

Regarding queue length and performance, I'm not sure why lengthening the queues remove such a big bottleneck. It took me a couple of hours tweaking with v0.22 which had less features than 0.26 (0.22 didn't have async decode or mt copy). The only way I managed to keep 0.22's performance high is to always use queues, even if they are not needed in a functional way (no need to calc time stamps). Applying that to 0.26 with some other fine tuning created v0.27.
I'll try to further root cause the bottleneck and hopefully use shorter queues.

If all else fails, we can go the GPU driver way - use profiles (e.g. cheat :) ). High queues for benchmarks and low/zero queues for the rest.

nevcairiel
18th February 2012, 10:37
I don't really care for benchmark values, but higher speed also helps with transcoding tasks or simply with seeking. Granted, when seeking there is probably no big difference between 200 or 300 fps, as long as its quite a bit faster then realtime speed.

egur
18th February 2012, 11:00
I don't really care for benchmark values, but higher speed also helps with transcoding tasks or simply with seeking. Granted, when seeking there is probably no big difference between 200 or 300 fps, as long as its quite a bit faster then realtime speed.

Seeking speed is determined by how fast current decode is aborted and how how the first frame from the new segment is outputted. I can somewhat improve on the former (abort stage) and the latter should improve if not using queues (do less work before outputting a frame).
So if you don't care about benchmarks, just kill the queues via config or set them to a length of 1.
It might be possible for you to expose an interface to the player and have him set an option for live/offline playback. Maybe the various player writers can comment on this.

nevcairiel
18th February 2012, 11:29
Seeking speed is determined by how fast current decode is aborted and how how the first frame from the new segment is outputted.

How fast the first frame is outputted depends on decoding speed. Seeking usually does not end up on a keyframe exactly, which means you need to start decoding at the previous key-frame, and there can be quite alot of frames in between now and the last key frame. The faster you decode those "pre-roll" frames, the faster the seeking is done. That means that decoding speed is directly proportional to seeking speed.

@STaRGaZeR:
Can you test if buffers set to 0 runs flawless for you?
Maybe i can add a "low latency" checkbox that would trigger this (and possibly also activate it by default). Should probably benchmark the difference to decide if i turn it on by default.


Edit:
I finished benchmarking in "low latency" mode with Queue = 0
https://docs.google.com/spreadsheet/ccc?key=0Ajo8vvjNtaZ5dC1abjBSeVlmcnZXSjYwampfamk3ZWc

The results are odd. On some clips, i see increased speeds, other clips remain the same - only that weird Samsung clip is significantly slower (apparently it has 16 ref-frames)
Maybe that decision should be based on the number of refframes a clip has? :d

egur
18th February 2012, 12:26
How fast the first frame is outputted depends on decoding speed. Seeking usually does not end up on a keyframe exactly, which means you need to start decoding at the previous key-frame, and there can be quite alot of frames in between now and the last key frame. The faster you decode those "pre-roll" frames, the faster the seeking is done. That means that decoding speed is directly proportional to seeking speed.

@STaRGaZeR:
Can you test if buffers set to 0 runs flawless for you?
Maybe i can add a "low latency" checkbox that would trigger this (and possibly also activate it by default). Should probably benchmark the difference to decide if i turn it on by default.

Edit:
I finished benchmarking in "low latency" mode with Queue = 0
https://docs.google.com/spreadsheet/ccc?key=0Ajo8vvjNtaZ5dC1abjBSeVlmcnZXSjYwampfamk3ZWc

The results are odd. On some clips, i see increased speeds, other clips remain the same - only that weird Samsung clip is significantly slower (apparently it has 16 ref-frames)
Maybe that decision should be based on the number of refframes a clip has? :d

I'm using the Samsung clip for testing. When queue len is 0, it runs at single threaded speeds (aside from MT copying). Meaning some of the MT code is being serialized somehow. I'm working on it. My next release will also include some tweaks to abort decode faster. I'll report again when I have something useful.

Edit:
The decode speed shown in benchmarks show average decode speed across the entire length. When queue length is zero. The QS decoder will output the first frames ASAP. Otherwise it will queue 16 decoded frames and then start to output them. If the HW decoder has to work hard (very high bitrate), the seek time should be lower as it will not have to decode 15 extra frames.

For ffdshow, I must keep queues at least 8 long for proper time stamp correction...

nevcairiel
18th February 2012, 13:02
In the meantime, i build a check based on ref frames. MPEG2 and VC1 are limited to two ref frames, so they always get 0, and for H264 i parse the value from the SPS if its available and set the value accordingly.

pururin
21st February 2012, 08:43
Greetings Eric. I've got a Sandybridge since early 2011 (using microsoft decoder before and you don't know how much I'm happy when your decoder were born!)

There's a thing I'd like to ask.
In this sample: http://www.wupload.com/file/2651705652

around 0.46-0.47 sec when color goes black there are artifacts on the lower half of the screen.
This happens only when HW acceleration is active whether on any decoder I've used. When inactive the color is pure black as it should've been.
:thanks:

NikosD
21st February 2012, 09:38
Eric,

I'm not finished updating my page with benchmark results.
I just did the tests for my system (signature).

What is the best way to update an existing Intel driver ?

Override the old driver with the new one ?

Uninstall manually from Control Panel (or other program) the old driver and then install the new one ?

Automatically install from the on-line tool of Intel's site the new driver ?

I want to benchmark the Intel system with the new driver but I don't want to mess things up because the Intel system is not mine and with present driver v2559, it works very well.

egur
21st February 2012, 09:51
Eric,

...
What is the best way to update an existing Intel driver ?

Download the Intel driver from Intel's download center.
Here're links to v2622: 32 bit (http://api.viglink.com/api/click?format=go&key=fad87231d097b7ba2504b595a06d6249&loc=http%3A%2F%2Fwww.avsforum.com%2Favs-vb%2Fshowthread.php%3Ft%3D1303066%26page%3D70&v=1&libid=1329813926490&out=http%3A%2F%2Fdownloadcenter.intel.com%2FDetail_Desc.aspx%3Fagr%3DY%26ProdId%3D3231%26DwnldID%3D20840%26ProductFamily%3DGraphics%26ProductLine%3DDesktop%2Bgraphics%2Bcontrollers%26ProductProduct%3DIntel%25c2%25ae%2BHD%2BGraphics%26lang%3Deng&ref=http%3A%2F%2Fwww.avsforum.com%2Favs-vb%2Fshowthread.php%3Ft%3D1303066&title=Official%20Sandy%20Bridge%20%2F%20LGA1155%20for%20HTPCs%20Thread%20-%20Page%2070%20-%20AVS%20Forum&txt=32-bit&jsonp=vglnk_jsonp_13298141264473) 64 bit (http://api.viglink.com/api/click?format=go&key=fad87231d097b7ba2504b595a06d6249&loc=http%3A%2F%2Fwww.avsforum.com%2Favs-vb%2Fshowthread.php%3Ft%3D1303066%26page%3D70&v=1&libid=1329813926490&out=http%3A%2F%2Fdownloadcenter.intel.com%2FDetail_Desc.aspx%3Fagr%3DY%26ProdId%3D3231%26DwnldID%3D20842%26ProductFamily%3DGraphics%26ProductLine%3DDesktop%2Bgraphics%2Bcontrollers%26ProductProduct%3DIntel%25c2%25ae%2BHD%2BGraphics%26lang%3Deng&ref=http%3A%2F%2Fwww.avsforum.com%2Favs-vb%2Fshowthread.php%3Ft%3D1303066&title=Official%20Sandy%20Bridge%20%2F%20LGA1155%20for%20HTPCs%20Thread%20-%20Page%2070%20-%20AVS%20Forum&txt=64-bit&jsonp=vglnk_jsonp_13298141819224)

No need to uninstall. Just install, reboot, run benchmarks, reinstall old driver and tell installer to overwrite the new driver with the old.
I jumped back and forth with driver versions a few times without problems.

egur
21st February 2012, 15:57
Greetings Eric. I've got a Sandybridge since early 2011 (using microsoft decoder before and you don't know how much I'm happy when your decoder were born!)

There's a thing I'd like to ask.
In this sample: http://www.wupload.com/file/2651705652

around 0.46-0.47 sec when color goes black there are artifacts on the lower half of the screen.
This happens only when HW acceleration is active whether on any decoder I've used. When inactive the color is pure black as it should've been.
:thanks:

I'll take a look. Unfortunately decode errors occur from time to time. Hopefully a future driver will fix them. BTW, what splitter was used?

pururin
21st February 2012, 17:56
I tried with both haali and Lav splitter. Just now I happened to think up so I tested with other HW mode in Lav
and find that dxva2 got quite the same too, so I wonder if the problem is at lower level or maybe something about the clip it self?
(in software decoding mode all is fine though)

CruNcher
22nd February 2012, 10:24
yep like the mysterious decodingerror.ts which btw wasn't fixed with the Driver update

egur
22nd February 2012, 10:28
yep like the mysterious decodingerror.ts which btw wasn't fixed with the Driver update

Hi Cruncher, can you repost the url for the clip

NikosD
22nd February 2012, 10:28
Eric,

I have finally completed my survey of HW and SW decoders at my thread for Intel and ATI HW.

The results are odd.

For clips 1 to 6 the new QS decoder 0.28, although is heavily optimized and multi-threaded, is A LOT SLOWER than QS 0.20 (more than 20%) for the FFDShow implementation.
LAV QS is faster, but not fast.

The same goes for VC-1 clip, too.

On the other hand FFDshow is very fast, even faster than native DXVA on clips 7 to 10.

Take a look here:
http://forum.doom9.org/showthread.php?t=163110

egur
22nd February 2012, 12:02
Eric,

I have finally completed my survey of HW and SW decoders at my thread for Intel and ATI HW.

The results are odd.

For clips 1 to 6 the new QS decoder 0.28, although is heavily optimized and multi-threaded, is A LOT SLOWER than QS 0.20 (more than 20%) for the FFDShow implementation.
LAV QS is faster, but not fast.

The same goes for VC-1 clip, too.

On the other hand FFDshow is very fast, even faster than native DXVA on clips 7 to 10.

Take a look here:
http://forum.doom9.org/showthread.php?t=163110

Very odd indeed.
I don't use DXVAChecker for testing, only GraphStudioNext. Nev's benchmarks (https://docs.google.com/spreadsheet/ccc?key=0Ajo8vvjNtaZ5dC1abjBSeVlmcnZXSjYwampfamk3ZWc#gid=0) which are very similar to what I see at home has significantly better results.
I'll try running it today with DXVA checker.

Can other users run at least one of the benchmark's clips and report what they got?
Maybe the code is not optimal for the i3 processor (Nev and I have i7).

Also please change "QS LAV" to LAV, it's confusing.

Correction:
You used an i5 system, which should provide similar performance to my i7.

CruNcher
22nd February 2012, 12:48
@ egur
you already have that stream anyway http://www.mediafire.com/?rld8gnlh52f03ud if you seek no problems if you let it play 1 frame will be corrupted on Quicksync nothing changed

NikosD
22nd February 2012, 13:51
Also please change "QS LAV" to LAV, it's confusing.



QS-LAV-QS means QS HW video processor-LAV video decoder-QuickSync (QS) mode

egur
22nd February 2012, 22:24
@NikosD
I ran a few tests at home and reproduced your results.
I don't know why the slowness in this scenario occurs or why ffdshow is a little faster.
What troubles me is that it doesn't work with at least one of the clips in LAV (basketball clip). In my PC it didn't crash, it was in some kind of infinite loop (taking a lot of CPU).
I ran DXVAChecker under a debugger and paused it during the freeze. LAV filter wasn't even loaded so the problem is somewhere within DXVAChecker.

It seems the copy back method used slows things down for the first 6 clips. For the remaining 4 clips, the decoder is slower than copying so there's no impact - they occur in parallel.
EVR also copies the image, so that's another overhead on top of the benchmarks in GraphStudioNext.

BTW, I removed the copy function call from my code and it didn't change the results that much (less than 3%) so something is definitely wrong here.

It seems that when EVR is present, the system is very sensitive to how many D3D9 surfaces I use within decoder. Less surface gives a performance boost.

nevcairiel
23rd February 2012, 06:29
The first step should be to figure out which parts actually require more time when running with EVR. Somehow it seems odd that the decoder itself just runs that much slower just because EVR is loaded. Otherwise the same problem would occur if you just open a EVR while doing a GraphStudio benchmark, wouldn't it?
I can even watch a movie in one window (with EVR) and benchmark in another window, without any performance loss.

The odd thing is, DXVAChecker even runs slower when its EVR is running on my NVIDIA with QuickSync decoding on the IGP. This doesn't make sense.
I have some ideas that might explain this, but i need to investigate a bit when i'm back home before i can comment on those.

ryrynz
23rd February 2012, 07:05
Nev, does it set to defaults? or perhaps keep previous settings? You could try removing the dGPU and try changing the setting then reinstalling it, I have a feeling it's probably setting defaults however.

NikosD
23rd February 2012, 07:21
I've finished my benchmark survey.

All platforms included (ATI, Nvidia, Intel) and the main decoders:

Microsoft DirectShow,
Microsoft MediaFoundation,
LAV Video all modes (NVCUVID, QuickSync decoder, DXVA2 Copy-back, DXVA2 Native),
CoreAVC (NVCUVID, DXVA2),
FFDShow QuickSync decoder.

Results here:
http://forum.doom9.org/showthread.php?t=163110

NikosD
23rd February 2012, 07:52
The first step should be to figure out which parts actually require more time when running with EVR. Somehow it seems odd that the decoder itself just runs that much slower just because EVR is loaded. Otherwise the same problem would occur if you just open a EVR while doing a GraphStudio benchmark, wouldn't it?
I can even watch a movie in one window (with EVR) and benchmark in another window, without any performance loss.

The odd thing is, DXVAChecker even runs slower when its EVR is running on my NVIDIA with QuickSync decoding on the IGP. This doesn't make sense.
I'm inclined to call tool failure. :p

Too bad GraphStudio can't benchmark with EVR, it has VSYNC enabled and tops out at 60fps :(

I have some more ideas that might explain this, but i need to investigate a bit when i'm back home before i can comment on those.

GraphStudioNext is missing a lot of benchmark informations.

It doesn't have GPU load, CPU load and most importart it doesn't have min and max values for FPS counting.

It displays only Avg values for FPS.

The benchmark mode of DXVA native, using EVR, is completely wrong.

It pushes CPU load to max 100%, which is wrong operation.
CPU has nothing to do with DXVA native.

You can benchmark DXVA native with DXVA Checker only, which has all of the above information, too.

With that tool - DXVA Checker - you can benchmark everything (CPU, DXVA (all modes), NVCUVID, QuickSync decoder)

Even for LAV copy-back modes I would definitely trust more DXVA Checker tool, than GraphStudioNext.

The latter is useful only for pure CPU mode.

nevcairiel
23rd February 2012, 08:00
And DXVAChecker cannot benchmark pure decoding performance without the overhead a renderer adds, quite a serious flaw if you want to benchmark anything that is not native DXVA.
No tool is ever perfect. :p

egur
23rd February 2012, 10:24
@Nev,
Please check the following with DXVa checker:
* Does LAV decoder connect to EVR using NV12?
* Basketball clip - does it refuse to connect to EVR? Is LAV decoder even instantiated?
* What splitter is used for the Basketball clip?

You can't benchmark EVR in GraphStudio - it maxes out at 60 fps (any clip). GraphStudio probably doesn't configure EVR for full speed.

When the renderer is on a dGPU, I see a performance boost but it's still very far from GraphStudio with NULL-renderer results.

nevcairiel
23rd February 2012, 18:06
I had some time to think about this, and to analyze some processing flows, and i have an idea that makes somewhat sense.

With all the multi-threading, the decoder usually works like this:

- Input Buffer
- Input Buffer
- Input Buffer
- Input Buffer
- Output Frame
- Output Frame
- Output Frame
- Output Frame
- Input Buffer
- Input Buffer
- Input Buffer
- Input Buffer
- ... and output again, and repeat

For three input buffers, no frame is output, but for the last one, it outputs 4 frames at once.
Now what i think the problem is that in the time it takes to actually render those 4 frames, the DirectShow filter cannot supply new data, so the QS decoder basically runs dry.
This is not a problem with the NULL renderer, because the rendering operation is instant (a no-op)

Eric, would it maybe be possible to spread out the output of those frames, so it would basically go like this:

- Input Buffer
- Input Buffer
- Input Buffer
- Input Buffer
- Output Frame
- Input Buffer
- Output Frame
- Input Buffer
- Output Frame
- Input Buffer
- Output Frame
- ... etc

Of course the initial delay will remain, but after that, try to interleave input and output events so that the decoder gets new data sooner? Of course there should be a threshold in case there are too many queued frames that it sometimes pushes out more then one, but 4-5 seemed to be a pattern i observed.
Am i making sense here? :)

PS:
Output Queue Length does not influence the behavior, but turning off MT causes it to be 90% properly interleaved.

egur
23rd February 2012, 20:14
The MT is different than what you described.
Ti represent thread i.

* T1 (Receive thread): Every compressed sample is sent to the decoder if it's idle asynchronously - MSDK returns a wait handle . If copied samples are ready they are sent to the owning DS filter (e.g. LAV).
* T2 (async decode thread) waits for a thread message. A message will include D3D surface and a wait handle from T1. T2 will wait on the handle (frame decode complete) and add the ready D3D surface to a queue. T2 will enter the message loop.
* T3 (post process/frame copy thread) wait on its message loop. A message contains a D3D surface from T2. It will push the D3D surface to an output queue. If the output queue is long enough, it will process the frame (mostly copy to system buffer). The result buffer is taken from a fixed size free frame queue. It will wait until a frame is available. After copying the result is saved into a processed frame queue, ready to be sent to the DS filter. Frame copying is done in MT using threads T4 and T5. T3 waits until they are done. The free frames and processed frames queues are short (size 2) in order to keep the L3 cache hot. This is not a bottleneck.

During the run of T1, it will query the processed frames queue several times and see if a frame is ready to be sent. So a frame is sent out very quickly.
If for some reason the DS decoder (LAV) doesn't get Receive calls no frames will be outputted. Maybe there's a way to increase EVR's frames queue to overcome the problem.

Update:
Disabling MT improves performance by more than 60%! --> Update False results. Running tests too late at night :(
Nev, even if your theory is correct, I don't see how to force it. Allow only 1 output frame per Decode call? I need to think about it.

nevcairiel
23rd February 2012, 21:01
Nev, even if your theory is correct, I don't see how to force it. Allow only 1 output frame per Decode call? I need to think about it.

I didn't even do any assumptions of how the MT works internally, i just added some logging on every call to Decode and every call of my Frame Callback. Thats what i saw, usually 4-5 Decode calls, and then it outputs 4-5 frames in one go.
Empirical evidence ftw. :)

I think my theory makes sense, and i think it shouldn't be too hard to test.
Anyway, yeah, output one frame per decode call, unless the output queue gets too long, then either output all the ones that are "too much" for the queue, or just output two for a while until its balanced out again.
This will ensure the decoder is always fed with new data to decode, and generally operates smoother and not in such bursts.

The fact that turning off MT actually improves performance supports my theory, because in single-threaded mode the calls to Decode and of the Frame Callback are usually pretty balanced.
I tried checking your code, but you have like 5 places where surfaces are delivered and only a queue of 2 frames, so this whole design would need to be adjusted a bit to allow for smarter handling of the frame output.

I would probably go with a queue size of maybe 6. If you're in the middle of the decoding process and you need more space, deliver just enough to make room, and then deliver one at the end of the decode process.
I don't feel confident enough with your whole MT trickery to try to adjust it myself, though, as i dont understand the reasons for all the scattered DeliverSurface calls and don't want to cause a deadlock.

I can probably even do that myself by queueing up frames internally (copy onto a media sample, but don't deliver yet). I would try that tomorrow or over the weekend, unless you try inside the Intel decoder first.

nevcairiel
23rd February 2012, 22:06
Also, now that i think about it again, i wonder if its really worth spending much time on. Even in the DXVAChecker benchmark, i get nearly 300 fps (granted, rendering on my NVIDIA GPU and with faster RAM).

If its easy enough for you to test, i wouldn't mind seeing results, but if its not, i don't care all that much.

If you have some other plausible ideas, i'm all ears.

egur
24th February 2012, 08:16
Using a dGPU really improved results.
Disabling MT hurt results. The results from last night were completely wrong, I had a build w/o actual copying.
I've managed to tweak the code a little but no big improvement when EVR is on the iGPU. I'll do more tests after the weekend.

CruNcher
24th February 2012, 10:51
So this Performance issue is only a problem if you use quicksync with rendering out on the iGPU @ the same time ?
Though it makes sense as the MFX uses the EUs and if the EUs are pressured their should be a performance impact and using EVR presures the EUs same as Aero (dwm) does @ the same time with Deinterlacing i guess the pressure should be even higher (you can actually measure the overhead its small though but it's their).
Im pretty sure with Encoding (H.264, here it's even official that the MFX uses the EUs for Motion Estimation) and Rendering directly out (EVR) you gonna see the same effect (most probably any PP in the Intel Control Panel might even stress more).

nevcairiel
24th February 2012, 11:43
I tested software mode in DXVAChecker and GraphStudio to get a baseline comparison, and i can basically see the same results.

Twin Peaks sample
Software: ~550 in GraphStudio, ~330 in DXVAChecker (EVR on both Intel and NVIDIA, didn't seem to make a difference).
QuickSync: ~400 in GraphStudio, ~280 on NVIDIA, ~190 on Intel in DXVAChecker.

I'm inclined to say that EVR is just the limiting factor and uploading the frames in the renderer just takes that much more time.
Maybe it can still be optimized, but i'm really wondering if its worth any effort. It wouldn't help CPU usage during playback, the only thing it would change is the benchmarking numbers.
It does show that not copying the frame back and forth is of course far more efficient, but in the end, during normal playback, the difference is still minimal.

The only time you really need everything the decoder can give you is either during benchmarking, or maybe for transcoding. The decoder is basically only limited by the consumer in this case, be it a renderer or a encoder.

In my opinion, just leave the speed be, and start thinking about deinterlacing. :)