View Full Version : Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing
nevcairiel
15th July 2012, 14:15
The problem with the GPU scaling is that its pretty limited to which pixel formats it can scale, which limits its usability in madVR. madshi once told me when asking about this that the GPU can only really be relied upon scaling NV12 properly, they usually fallback to other (cheaper) algorithms on other content.
Also, i personally prefer a tad bit softer scaling then Lanczos4. On HD content you don't notice the difference, but on SD content it helps hiding some of the source artifacts.
There is no one scaling algorithm that is good for everyone (at least none that is fast enough for real-time processing)
CruNcher
15th July 2012, 14:31
Yep Madshi integrating your scaler via MediaSDK would rock, though being in that Multi GPU enviroment on NT 6 myself now it feels odd also alot of Devs seem to have issues handling it right already found several issues in applications where ENcoding for example stops working even if the encoder is correctly selected (though using a cmdline encoder works) and then the whole trouble with what is the Primary Device and the Primary Driver and what the secondary really funny it gets with apps that are hooking into WDM ;)
I hope Win 8 will ease that a little :D
Having control of all inputs in that easy user way is powerful though, but i still wonder why intel is limiting the encoder MediaSDK to a physical output what is the reason for that ? :)
I hope Virtu soon gets replaced by Nvidias Synergy looking @ Optimus and how advanced it is in the Driver by now would be much better have everything from 1 hand which is working on that low level all the time and can better optimize @ the Driver stack then just manipulating on LucidLogix End.
On the other side the Idle Power factor is slowly vanishing seeing where AMD headed to the whole time and Finally Nvidia is heading to too, Kepler was way about time and they did nice work imitating AMD/Intel ;)
And Intel is driving the other direction gaining more Performance, everything perfect for end consumers lot of choice with acceptable compromises so everyone can survive :)
Now if we get the interaction layers between all those better tweaked it would be gigantic and in those terms i have to say LucidLogix has it's place ;)
So yeah of course HyperPerformance is a dream combining all different Vendor Hardware for 1 result, though in easy to manage end user ways i would say for D3D @ least we aren't that far away anymore from it and LucidLogix does a very good job here ;)
egur
15th July 2012, 19:13
HW scaling doesn't have to go through Media SDK - EVR doesn't use it. Media SDK has a few limitions to be properly used in a renderer, better use the D3D/DXVA API for that.
The MSDK encoder is an example - people can integrate the relevant parts (sources are open) to their application. The output of the encoder API is to memory not disk.
malmsteen81
15th July 2012, 19:43
hi guys, if i buy a z68's motherboard whit i7 2600k can i use quicksync for decoding video with ffdshow and use a nvidia gts 450 for the svp interpolation and madvr? i know that quicksyng is disable with p67.
thanks
egur
15th July 2012, 20:22
hi guys, if i buy a z68's motherboard whit i7 2600k can i use quicksync for decoding video with ffdshow and use a nvidia gts 450 for the svp interpolation and madvr? i know that quicksyng is disable with p67.
thanks
Yes you can do that, but you'll need to enable the Intel GPU by extending the desktop to it. See here (http://forum.doom9.org/showthread.php?p=1532786#post1532786)
malmsteen81
16th July 2012, 07:55
ok thanks, other question, is the sharpness filter a normal edge enhancement or detail enhancement?
egur
16th July 2012, 08:07
ok thanks, other question, is the sharpness filter a normal edge enhancement or detail enhancement?
No, it's much more sophisticated than a simple filter. Best quality is gained by setting the strength around 50%. Don't use max settings.
CiNcH
16th July 2012, 21:19
egur, how many 1080i streams is the media sampler fixed function unit able to deinterlace and scale? There is always the talk about how many 1080p H.264 streams the fixed function decoder can handle. But no talks about 1080i at all. Do you have a clue?
egur
17th July 2012, 09:02
egur, how many 1080i streams is the media sampler fixed function unit able to deinterlace and scale? There is always the talk about how many 1080p H.264 streams the fixed function decoder can handle. But no talks about 1080i at all. Do you have a clue?
I didn't benchmark deinterlacing yet.
Actual performance is affected by memory speed, bitrate (decoder performance) and what the renderer does amongst other things. iGPU clock speed is also relevant - may not work at turbo clocks when CPU is very busy.
Also matters if you output 60p or 30p from the deinterlacer.
CiNcH
17th July 2012, 09:56
What I am talking about is PiP for live broadcasts, so H.264 1080i (MBAFF) at bitrates of about 15mbps. And of course I want to have highest quality deinterlacing with frame-doubling (50p for Europe). Let's assume we use standard EVR, memory is a standard DDR3-1600 and everything runs at stock speed. Let's also assume an otherwise idle CPU and GPU. Do you think that both, HD2500 and HD4000, can handle 2 such streams in parallel for PiP? Just your feeling... I have no feeling about that at all.
egur
17th July 2012, 10:30
On the system you specified, if DI is performed in EVR, it should work for 2 streams maybe even more.
I just tested this on an IvyBridge laptop (HD 4000) with 1600MHz ram.
Used a 12bmps mbaff stream 1080i (60i).
Was borderline working with 5 streams. 4 streams were perfect.
Used LAV decoder as ffdshow doesn't like appear more than once in the graph.
Used GraphStudioNext with:
* LAV splitter source
* LAV video decoder (QuickSync)
* EVR
CiNcH
17th July 2012, 10:37
Thanks for your test!
Used a 12bmps mbaff stream 1080i (60i).
Was borderline working with 5 streams. 4 streams were perfect.
And all of them went through the FF deinterlacer with 60p output, right?
I just tested this on an IvyBridge laptop (HD 4000) with 1600MHz ram.
Do you think that the throughput is significantly lower with a HD2500? How much were the EU's occupied? Guess the FF blocks are the same for HD2500 and HD4000 based IVB processors?
On the system you specified, if DI is performed in EVR, it should work for 2 streams maybe even more.
Doesn't EVR also trigger the FF deinterlacer in DXVA mode? Is it so much different from the result with the Quick Sync API then?
egur
17th July 2012, 11:38
And all of them went through the FF deinterlacer with 60p output, right?
yes, output was 60p on all concurrent streams - this is the default for EVR.
Do you think that the throughput is significantly lower with a HD2500? How much were the EU's occupied? Guess the FF blocks are the same for HD2500 and HD4000 based IVB processors?
Shouldn't be much lower, most of the work is done in FF which is identical between HD 2500 and 4000.
Doesn't EVR also trigger the FF deinterlacer in DXVA mode? Is it so much different from the result with the Quick Sync API then?
Same HW is used by both.
Note that I used EVR to deinterlace the streams. Using ffdshow's deinterlacing uses more system resources (and ffdshow usually doesn't work in multiple instances in a graph)
I also forgot to mention that the laptop was a dual core (4 threads) i5 IvyBridge CPU.
CiNcH
17th July 2012, 11:43
Pretty cool results! Thanks!
egur
17th July 2012, 11:51
It was better than I expected.
Probably a quad core could do better since the dual core (35W, 2.4GHz when stressed) hit 100% utilization with 5 streams.
CiNcH
17th July 2012, 12:02
Probably a quad core could do better since the dual core (35W, 2.4GHz when stressed) hit 100% utilization with 5 streams.
Do you have a clue what is causing this much CPU utilization?
egur
17th July 2012, 12:07
Do you have a clue what is causing this much CPU utilization?
Memory copy from GPU ram to system RAM (and system RAM to system RAM as well). this wasn't a DXVA test. DXVA will behave much better.
CiNcH
17th July 2012, 12:13
Ah, I see. Didn't know that Quick Sync always "copies-back".
egur
17th July 2012, 12:20
Ah, I see. Didn't know that Quick Sync always "copies-back".
Now it's the only option and one of it's main features - allows existing SW flows to work easily.
Pure DXVA is the most efficient but has many limitations. DXVA output is my next feature, no ETA when it will be delivered and probably only LAV will support this. LAV's developer (nevcairiel) must agree to do it before I even start designing this feature...
In ffdshow it will be extremely complex due to it's internal video processing pipeline which will not work with DXVA surfaces. ffdshow has a DXVA mode which is not very stable.
nevcairiel
17th July 2012, 13:20
I could support DXVA Output, the question is how the interface would look like.
I don't know if you have read up on the subject yet, but i need to be able to provide a Sample Allocator that associates outgoing media samples with DXVA surfaces. For the "normal" DXVA codec this is done by allocating the surfaces inside this allocator, and then telling the codec to decode onto those surfaces, and thus always having a fixed link between the samples and the surfaces. Its important to know that the renderer may buffer a few of these surfaces, so they can only be re-used when the renderer released them.
Only you can tell me if you have such fine-grained control over the decoding process.
An "easy" solution would be to keep a number of delivery surfaces internally (say 4), and copy the video frame from the decoders surface onto one delivery surface. I'm not sure how much of a performance hit that would be, but it would also allow blending subtitles on the surface without the typical corruption ffdshow creates when doing that (because it does it on the decoders surfaces, which are still used as reference frames). In any case it should be significantly faster then copying to the system RAM.
In this model, your QS decoder would just tell me about the decoders surface, and i would copy onto my delivery surface and send it to the renderer. All communication with the renderer is handled by me, and you only have to output the surfaces - and you can let the MSDK manage the surfaces in any way it wants.
egur
17th July 2012, 13:57
What you're saying is already known to me (no surprises :) ).
Having 2 sets of surfaces (1 in QS and 1 in LAV) is an interesting option.
I thought that I should provide an option to output already copied D3D surfaces so subs could be placed.
The decoder allocator should be created by me. I need to make a small change to the exiting D3D allocator to create DXVA compliant media samples. I have sample code for that in the media SDK.
The output struct used today will be changed to add the DXVA media sample + flags that specify its properties (original/copy, D3D/system ram, etc).
Probably forgot a few things, I'll start working on it soon.
nevcairiel
17th July 2012, 14:04
Well if you want to handle the allocator, thats fine by me, i just need a function i can call to get the allocator (my DXVA Decoder exposes InitAllocator(IMemAllocator **ppAlloc)), and i also need to be able to detect without a doubt if we're in "DXVA Native" mode, or in CopyBack mode, you cannot switch between those at runtime, i need to tell the renderer this directly after connection.
While we're on the topic, it would be great to have a function that completely re-inits the decoder without it losing its D3D interfaces, so that the re-init can be done during exclusive mode without much trouble (for example in madVR exclusive mode, where all the interfaces are obtained before playback, exclusive mode is entered, and when we re-init now the interfaces won't be available anymore)
Last i tested, just calling InitDecoder(..) alone wasn't enough to say switch between input formats, i had to re-create the whole thing. I should probably try that again before making wild claims, though, its been a while. ;)
egur
17th July 2012, 15:54
initDecoder can only be called once ATM. I can change this behavior but I need a good incentive.
The decoder gets a hard reset on every seek, so that's one option.
If the new stream belongs to the same fourcc but have different properties my decoder will recover from this, it might need to reallocate all the surfaces.
Reverting to a new codec or different fourcc (impacts stream processing) is currently not supported.
I may not understand the use case you're referring to.
For full screen exclusive mode, I need the D3D device manager from the renderer. This functionality exists via the SetD3DDeviceManager method.
nevcairiel
17th July 2012, 17:01
Playback starts outside of exclusive mode with madVR and everything is fine, madVR goes into exclusive mode during playback, and the the video format changes (maybe because the user changed the tv channel or something like that), and a re-init of the QS decoder is required.
Because madVR is in exclusive mode, and the QS decoder discards all D3D interfaces, you cannot re-init, and playback just fails (or falls back to software).
I guess i can always create my own D3D Device Manager (before entering exclusive mode) and supply that to the QS decoder, if nothing else helps.
egur
17th July 2012, 20:32
Playback starts outside of exclusive mode with madVR and everything is fine, madVR goes into exclusive mode during playback, and the the video format changes (maybe because the user changed the tv channel or something like that), and a re-init of the QS decoder is required.
Because madVR is in exclusive mode, and the QS decoder discards all D3D interfaces, you cannot re-init, and playback just fails (or falls back to software).
I guess i can always create my own D3D Device Manager (before entering exclusive mode) and supply that to the QS decoder, if nothing else helps.
To be on the safe side, you should destroy QS and launch a new one, performance-wise it would be almost the same. You should query the renderer as soon as you connect to it and retrieve his IDirect3DDeviceManager9 interface.
Here's the code from the ffdshow proxy class:
void TvideoCodecQuickSync::setOutputPin(IPin *pPin)
{
if (!ok) { return; }
if (NULL == pPin) {
m_QuickSync->SetD3DDeviceManager(NULL);
}
IDirect3DDeviceManager9 *pDeviceManager = NULL;
IMFGetService *pGetService = NULL;
HRESULT hr = pPin->QueryInterface(__uuidof(IMFGetService), (void**)&pGetService);
if (SUCCEEDED(hr)) {
hr = pGetService->GetService(MR_VIDEO_ACCELERATION_SERVICE, IID_IDirect3DDeviceManager9, (void**)&pDeviceManager);
}
m_QuickSync->SetD3DDeviceManager((SUCCEEDED(hr)) ? pDeviceManager : NULL);
if (pDeviceManager) { pDeviceManager->Release(); }
if (pGetService) { pGetService->Release(); }
}
You should cache the IDirect3DDeviceManager9 for future use in case you need to reinit QS (or your DXVA decoder).
BTW to support QS under FSE in Windows Media center, after connecting to the renderer, getting the IDirect3DDeviceManager9 interface and sending it to QS decoder, it will not fail in the TestMediaType function because it now it can create the HW acceleration devices.
Don't create your own IDirect3DDeviceManager9 - it's meaningless. I try to create one if not given one, but it will fail under FSE.
andyvt
17th July 2012, 20:38
BTW to support QS under FSE in Windows Media center, after connecting to the renderer, getting the IDirect3DDeviceManager9 interface and sending it to QS decoder, it will not fail in the TestMediaType function because it now it can create the HW acceleration devices.
Will this also provide support in other FSE players?
nevcairiel
17th July 2012, 20:45
Don't create your own IDirect3DDeviceManager9 - it's meaningless. I try to create one if not given one, but it will fail under FSE.
Like i explained above, this was only for renderes that are not EVR and thus do not supply their IDirect3DDeviceManager9.
And of course it only works if you start outside of FSE, so you can create it and keep it for later.
Anyway, to support this flow i still need to re-work some of the decoders internals, right now i assume it has everything it needs during connection of the input pin. I'll do it at some point.
Will this also provide support in other FSE players?
If they use EVR, then yes.
egur
17th July 2012, 20:55
OK, I see your point. I only tested FSE with EVR as MadVR didn't cause any issues. Your use case is new to me.
I'll change initDecoder so it would work multiple times while retaining the IDirect3DDeviceManager9 interface from the last instance.
Can you recommend a way to test this under real world conditions?
nevcairiel
17th July 2012, 21:00
If its alot of effort don't bother with it. Testing in real-world is also not easy, don't think i have a setup off-hand that would need that.
I was testing something when i ran into the problem a while ago, where i modified LAV to re-init the decoder on live tv channel changes (in between the same codec, different codec causes a full graph rebuild), and it froze under madvr, but i reworked the code.
If it ever becomes a real problem for me, i can always provide it with a device manager.
egur
17th July 2012, 22:03
The effort may not be large but unable to properly test things usually leads to delays. When I get some free time I'll give it a shot.
egur
18th July 2012, 12:48
Will this also provide support in other FSE players?
Most FSE enabled players turn to FSE after the movie is loaded. So far the exception is Windows Media Center which is always in FSE.
If the renderer is EVR, it can provide the mentioned interface so HW acceleration devices can be created under FSE.
I guess when Microsoft designed FSE they didn't have DXVA in mind as the above inconvenient flow looks like an ugly patch.
andyvt
18th July 2012, 13:28
Most FSE enabled players turn to FSE after the movie is loaded. So far the exception is Windows Media Center which is always in FSE.
SageTV is another exception (or at least it can be).
nevcairiel
18th July 2012, 13:37
I guess when Microsoft designed FSE they didn't have DXVA in mind as the above inconvenient flow looks like an ugly patch.
The generic concept of DXVA always asks the renderer for the Interfaces to perform decoding, you never want to create your own.
Its only those render-independent DXVA things that causes such issues, the "copy back" solutions. ;)
Luckily NVIDIA was smarter and lets you access the HW decoder without a D3D device. Even works without a screen connected at all. Intel should totally do that ;)
STaRGaZeR
18th July 2012, 17:05
intel should totally do that ;)
;) ;)
egur
18th July 2012, 21:59
Yes, it's in my wish list - in the top 3.
wanezhiling
21st July 2012, 10:22
win7 x64
I3 2330M
HD3000
Driver 2761
http://pan.baidu.com/netdisk/singlepublic?fid=686481_1035451090
Corrupt image, tested with latest ffdshow QS x86, latest LAV QS x86, latest PotPlayer QS x86
PS: Native DXVA and DXVA CB are fine.
egur
21st July 2012, 15:13
win7 x64
I3 2330M
HD3000
Driver 2761
http://pan.baidu.com/netdisk/singlepublic?fid=686481_1035451090
Corrupt image, tested with latest ffdshow QS x86, latest LAV QS x86, latest PotPlayer QS x86
PS: Native DXVA and DXVA CB are fine.
I'll take a look later on. Thanks.
Edit
I didn't see any corruption but I'm using a newer driver, using LAV splitter and Haali. It's also a long clip so where (time) did you see the corruption?
Which splitter was used?
wanezhiling
22nd July 2012, 14:02
I didn't see any corruption but I'm using a newer driver, using LAV splitter and Haali. It's also a long clip so where (time) did you see the corruption?
Which splitter was used?
http://i.imgur.com/Gr6vl.jpg
ffdshow rev4475 icl12
LAV 0.51.3
http://i.imgur.com/IH3gY.jpg
http://i.imgur.com/k4M5H.jpg
egur
22nd July 2012, 17:20
http://i.imgur.com/Gr6vl.jpg
ffdshow rev4475 icl12
LAV 0.51.3
http://i.imgur.com/IH3gY.jpg
http://i.imgur.com/k4M5H.jpg
This sort of corruption is caused from a bad Media SDK DLL installation.
Install the driver from Intel's download center.
NikosD
22nd July 2012, 20:14
Latest LAV (0.51.3) with latest Intel drivers (2792) broke H.264 HW acceleration for QS, it falls back to software for every H.264 file I tried.
HW acceleration for MPEG2, VC-1, WMV3 works good, as always.
egur
22nd July 2012, 20:39
Latest LAV (0.51.3) with latest Intel drivers (2792) broke H.264 HW acceleration for QS, it falls back to software for every H.264 file I tried.
HW acceleration for MPEG2, VC-1, WMV3 works good, as always.
These are Window 8 drivers, I don't think they work well in Win7.
Latest Win7 driver is 15.26.12.64.2761.
The 2792 belong to the 15.28 line (still in beta!) which is targeted for Window 8.
I'll give it a try on Window 7 and see if there's a fix.
Update:
Everything works with the exception of several H264 clips in my test suite. I'll try to fix this within the next few days.
Update 2:
The Win8 2792 driver is not in a good shape for everyday use. Please use 2761 or newer drivers from the 15.26.xx.xx family for both IvyBridge and SandyBridge.
I'll report the issues I've found and hopefully by Win8 launch time (Oct. 26), the driver will be in good shape.
For the sharp eyed testers out there, the 2792 driver seems to play transposed 720p (720x1280) clips on SandyBridge which no other driver managed to do. But 4K doesn't work.
wanezhiling
23rd July 2012, 10:15
This sort of corruption is caused from a bad Media SDK DLL installation.
Install the driver from Intel's download center.
Thanks, I knew where the problem was, now everything is fine.
NikosD
23rd July 2012, 10:30
For the sharp eyed testers out there, the 2792 driver seems to play transposed 720p (720x1280) clips on SandyBridge which no other driver managed to do. But 4K doesn't work.
Your last phrase gives me hope that there are Intel engineers trying to make QS of SandyBridge to work with 4K resolutions :)
This is very good!
CharlieCL
25th July 2012, 03:41
The generic concept of DXVA always asks the renderer for the Interfaces to perform decoding, you never want to create your own.
Its only those render-independent DXVA things that causes such issues, the "copy back" solutions. ;)
Luckily NVIDIA was smarter and lets you access the HW decoder without a D3D device. Even works without a screen connected at all. Intel should totally do that ;)
MS's solution is bind DXVA and EVR together. That is very limited. An ideal solution is that HW decoder can be applied for any renderers. HW decoder is not related to GPU. It is just a HW function or it is running by a special DSP. HW decoder is not a D3D device.
nevcairiel
25th July 2012, 07:07
HW decoder is not related to GPU.
Of course it is, its the GPU doing the HW decoding.
The fact of the matter is that DXVA is the only generic interface on Windows available to us, so you can complain all you want, its all we have.
However, its not all that dark as you paint it. DXVA2 is not tied to EVR (in contrast to DXVA1), in fact any renderer could implement it, its well documented on the MSDN. Heck you can even use it without any renderer at all.
The only problem you have to overcome is when the system is in D3D exclusive mode, because you can't open the required D3D interfaces, and thats where Microsoft made the link to EVR to provide these interfaces - but its again nothing thats only available to EVR, another renderer could also implement this.
egur
25th July 2012, 07:08
MS's solution is bind DXVA and EVR together. That is very limited. An ideal solution is that HW decoder can be applied for any renderers. HW decoder is not related to GPU. It is just a HW function or it is running by a special DSP. HW decoder is not a D3D device.
The interface that Microsoft defined for HW decoders is DXVA/D3D. So it is a D3D device.
There are other ways to communicate with the HW codec but they differ from between manufacturers and within models and these interfaces may not be public.
QuickSync encoder is used this way as DXVA doesn't support encoding.
Adding to Nev's DXVA limitation:
Can't create a D3D device on a headless (disconnected) GPU. Very serious limitation.
CharlieCL
25th July 2012, 18:40
DXVA2 is not tied to EVR (in contrast to DXVA1), in fact any renderer could implement it, its well documented on the MSDN.
But for a non-DXVA2 renderer the HW acceleration is disabled in decoder. The software decoder will work. The decoder will notify the video renderer that the decoder is using DXVA decoding. If a renderer does not support DXVA2, this renderer will be replaced by a default renderer. That is the problem.
I can only find the DXVA2 for decoder sample code but no DXVA2 for renderer sample code. The only available sample is EVR-CP.
nevcairiel
25th July 2012, 19:00
Your setup must be weird. When i use LAV Video in DXVA2 mode and a renderer is not capable to handle DXVA2, then LAV falls to software mode, and not the renderer is changed.
wanezhiling
25th July 2012, 19:07
If a renderer does not support DXVA2, this renderer will be replaced by a default renderer. That is the problem.
Update your LAV.
CharlieCL
25th July 2012, 20:52
Your setup must be weird. When i use LAV Video in DXVA2 mode and a renderer is not capable to handle DXVA2, then LAV falls to software mode, and not the renderer is changed.
When some media types was setup to play by MS codecs this happened. In LAV it worked in software mode.
However what I expected is to run in hardware acceleration whether the renderer is DXVA2 or not.
Why CreateD3DDeviceManager() can not be done in decoder?
And access the buffer inside the decoder from any renderer?
vBulletin® v3.8.11, Copyright ©2000-2025, vBulletin Solutions Inc.