View Full Version : Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing
egur
4th September 2011, 22:26
Updated June 22nd 2013
Hi,
My name is Eric Gur and I've taken upon myself a side project at my Intel position to make the Intel SandyBridge (or newer) hardware accelerated video decoding technology freely accessible to everyone.
The project name is Intel QuickSync Decoder.
To do so, I decided to embed the Intel QuickSync technology introduced in SandyBridge into the widely popular FFDShow video decoder filter.
Nowadays, the Intel QuickSync Decoder is officially integrated in FFDShow, LAV Video Decoder and PotPlayer.
Main features
* HW decode using Intel's high performance QuickSync engine.
* Decodes H264, MPEG2, VC-1, WMV9. DVD playback not supported.
* HW deinterlacing -auto or forced, with half or full (50/60p) output rate
* HW denoise and detail filters
* Soft 3:2 pulldown on marked streams.
* Support variable frame rate streams.
* Support headless iGPU (Intel GPU disconnected from display) on Windows 8 and newer.
If your system meets the requirements, I'd appreciate stability feedback with assorted quality and sources of video content.
To report a bug report or feature request, please post in this thread.
If something is broken, please provide me with a detailed report including (after reading the known issues section below) :
1. Hardware (CPU, GPUs)
2. Software (OS, driver version, player, splitter, etc.)
3. Access to the offending content. Share via your favorite file share sites. Limit content to <100MB.
Requirements:
1. SandyBridge (2nd Generation Core i3/i5/i7/celeron/pentium) or newer. Older platforms will not work and no plans to support them.
2. Latest Intel graphic drivers. Intel GPU must either be the primary GPU, extended display or use Lucid Virtu.
3. Windows 7 (32/64) or newer OS. Should work in Vista but I can't test this.
Known Issues:
* Jumpy playback or heavy corruption on many clips are the result of drivers obtained from Windows Update. Download drivers from your OEM website or directly from Intel's download center (http://downloadcenter.intel.com/). Some versions of Lucid Virtu will cause video playback in 64 bit player to display frames out of order.
* Frame rate is wrong or incorrect aspect ratio: Haali Media Splitter is sending corrupt time stamps or aspect ratio. LAV splitter is recommended.
* After a seek in a TS file, a corruption is seen for a few frames. LAV splitter known issue.
* Resolutions greater than 1080p aren't supported in SandyBridge.
Installation:
1. An ffdshow installer is supplied.
2. Open FFDShow configuration dialog and select 'Intel Quicksync' from the codec page for the desired formats (H264/VC1/MPEG2).
Version 0.45 is out with the following changes:
* Bugfix - frames were sometime treated as interlaced.
* Bugfix - time stamps are passed 'as is' when TS manipulation is off.
* Bugfix - time stamps handling was causing A/V delay.
* Changed: AnnexB type packets (AVC in TS files) is not pre-processed and sent to the HW decoder directly. May break a broken clip or two but save many others.
* Sync with MSDK 2014 files.
* FFDShow: r4531
Downloads
* For the latest cutting edge FFDShow builds download my builds Intel QuickSync Decoder SourceForge home page (http://sourceforge.net/projects/qsdecoder/)
* FFDShow-tryout site (http://ffdshow-tryout.sourceforge.net/download.php)
* LAV Splitter builds (http://forum.doom9.org/showthread.php?t=156191)
Guest
4th September 2011, 22:32
Welcome to the forum, Eric! And thanks for your contribution. I haven't got a SandyBridge but I'm sure you will get a lot of testers here.
Eliminateur
5th September 2011, 00:45
egur, this is very good to know, i have some questions:
1) Does SB have specific problems with DXVA interfaces what it needs specific quicksync support?, it's known to crash MPC-HC and ffdshow DXVA as well
2) What about the Pentium Gxxx series?, since they don't have quicksync...
egur
5th September 2011, 07:38
egur, this is very good to know, i have some questions:
1) Does SB have specific problems with DXVA interfaces what it needs specific quicksync support?, it's known to crash MPC-HC and ffdshow DXVA as well
2) What about the Pentium Gxxx series?, since they don't have quicksync...
1) I'm not aware of any specific DXVA issues. QuickSync implementation is done with DXVA and further abstracted by the Intel Media SDK which I've used to create this FFDShow version. Using the DXVA interface directly isn't trivial and needs quite a few workarounds, the Media SDK takes care of some them. I had trouble myself with FFDShow-DXVA using both Intel graphics and an AMD Radeon 6950. Currently I managed to play dosens of HD (and non HD) movies well but I don't think the SW is at production level. I haven't tested with MPC-HC yet but I will.
2) Regarding the Pentium brand, I don't know. If someone has it, please let me know.
kirakami
5th September 2011, 08:47
What is Sandy Bridge?
will Intel Pentium 4 CPU built in year 2001 support?
& Geforce 4 mx
egur
5th September 2011, 09:00
What is Sandy Bridge?
will Intel Pentium 4 CPU built in year 2001 support?
& Geforce 4 mx
SandyBridge is the codename for Intel's latest generation CPU. Also called "2nd Generation i3/i5/i7 Core Processor".
SandyBridge has 2-4 cores an integrated GPU, integrated memory controller and integrated PCIe controller.
Pentium 4 doesn't have the HW needed and will definitely not work. My build of FFDshow might work on Core 2 Duo/Quad and i3/i5/i7 if and only if there's an Intel integrated GPU (can be found in many laptops and low end desktops). This wasn't tested though.
It will not work on AMD processors either as they do not have compatible HW.
My build should work on future processors with Intel graphics such as IvyBridge and Haswell.
namaiki
5th September 2011, 10:28
My build of FFDshow might work on Core 2 Duo/Quad and i3/i5/i7 if and only if there's an Intel integrated GPU (can be found in many laptops and low end desktops). This wasn't tested though.
Unfortunately doesn't seem to work on my i5 with Intel HD graphics (Arrandale).
Tested on Windows 7 (64-bit) in MPC-HC (32-bit).
CruNcher
5th September 2011, 11:59
Nice work Eric though i guess it wont do any better then Intels own Decoder sample in IMSDK 3 ?
@ least for Mpeg-2 it seems questionable if the hassle with different setups is worth it from my meassuring it saves somewhere 1W on my Core I5-2400 compared to ffmpegs decoder, though you will have all the hassle with Mpeg-2 Studio 4:2:2 switching as the Intel Decoder is same as Nvidias also not capable of doing this with DXVA :)
Of course it looks totally different for H.264 (there is the biggest save compared to the Worlds most Performant Software Decoders, but again if we come to the 10 Bit 4:2:2 and 4:4:4 or Lossless area everything fals apart again)
but also VC-1 im not sure at least WMV3 seems not to perform much better on Quicksync then again Libavcodecs decoder on the CPU :)
http://forum.doom9.org/showthread.php?p=1523685#post1523685
a follow up on that terminating overhead further
http://forum.doom9.org/showthread.php?p=1523692#post1523692
though it's cool that you (Intel) now also want to optimize based on samples like Nvidia did in the early days :)
first thing you should look @ this sample http://forum.doom9.org/showthread.php?p=1523293#post1523293
i tried alot but i don't get it stable with EVR and Intels Decoder (it doesn't matter which splitter the tree pan doesn't get smooth hardware decoded also with Microsofts DTV-Decoder no go, the only solution for this sample is the Lav based Framework on EVR it gets perfectly smooth then perfectly telecined)
and then there is my issue with my sample.ts (also telecined though H.264) on EVR custom but im not so sure if this is a Intel fault though Software decoding again works fine but Hardware fails with EVR Custom see a Video of this issue http://mirror05.x264.nl/CruNcher/mpc-hc/ (Btw made with Quicksync ;) ) <- Fixed with FFdshow for Quicksync :)
Intel Driver is = 8.15.10.2476 (Windows 7 64 bit)
Im trying your decoder now with all this
PS: You should mention that it's 32 Bit in your post ;)
Superb news my sample.ts (H.264) (EVR Custom) issue is history with this, perfect telecined 23.976 :)
Perfect awesome it doesn't allow Mpeg-2 Studio Profile connection and so fallbacks like it should be :)
This is the most awesome Decoder for Quicksync currently (except overhead being not DXVA2 Native is huuuge depending on stream see here after bugs http://forum.doom9.org/showthread.php?p=1523906#post1523906) :)
Though the correct telecine to 24.30 (evil_tree Mpeg-2 1080i 29.970 sample) is problematic also with it on EVR it does 0.30 fps to much it seems (interlace flags off) :(
http://img26.imageshack.us/img26/47/eviltreeffdshowquicksyn.png
Really tricky :D
this is what it should look like in the end (works only on EVR normal);)
http://img3.imageshack.us/img3/3503/smoothmotion.png
else you wont get the tree pan smooth ;)
Default Telecine works perfect even on EVR Custom :)
http://img18.imageshack.us/img18/8587/defaulttelecineperfect.png
It also likes to crash with several *.ts files in combination with Lav Splitter (those crashy ones work fine with the Internal MPC-HC ts splitter) http://forum.doom9.org/showthread.php?t=156191 Yep it crashes a lot with Lav Splitter :(
No Vsync no Exclusive mode nothing just Aero and Quicksync (again you can nicely see the jitter the Stats and Graph Rendering causes current EVR Custom OSD overhead) :D
http://img706.imageshack.us/img706/8000/novsyncjustaeroquicksyn.png
CruNcher
5th September 2011, 17:11
Major issues with VC-1 in *.ts either Sync problems or Rendering issues (different VC-1 Interlace encoding mixed modes) :(
Sync Issues:
http://img405.imageshack.us/img405/1606/vc1syncissuesffdshowqui.png
Rendering Issues: (This Problem Nvidia fixed ages ago ;) )
http://img851.imageshack.us/img851/7111/vc1interlaceproblemsffd.png
It also crashes for both with Lav Splitter had to switch to MPC-HC Internal Splitter ;)
Incorrect Telecine again :(
Lav-Splitter->Lav-Audio->FFdshow quicksync : (Incorrect)
http://img822.imageshack.us/img822/4110/mpeg2lavsplitterffdshow.png
MPC-HC Internal->Lav-Audio->FFdshow quicksync : (Correct)
http://img827.imageshack.us/img827/4647/mpeg2mpchcffdshowquicks.png
Though i slowly wonder if this is DXVA2 hardware Playback also because MPC-HC doesn't show any DXVA2 information (or more something like Nvdias NVcuvid API own Intel API but even for that it would be heavy overhead, just for Playback purpose ??) as i get much much lower CPU utilization with Microsofts DTV-Decoder (DXVA2) on H.264 streams ???? (lets see 4 girls is coming ;))
Yeah really heavy that overhead on this small HD2000 compared to Microsofts DXVA2 :)
ffdshow-quicksync overhead:
http://img835.imageshack.us/img835/4941/ffdshowquicksyncomgover.png
Native DXVA2 is still the way to go (imho we just need a better optimized playback framework for Quicksync and not only for it ;) ) :)
http://img834.imageshack.us/img834/7083/msdtvdxva2yoonyoon.png
Though will be really interesting to compare vs Nvcuvid overhead :)
Known Issues:
1. Higher CPU usage on low bitrate clips.
No comment :D
Blight
5th September 2011, 20:50
The major issue here is the overhead the driver adds for memory copies.
John Carmack (ID Software) wrote about it in this interview (http://www.pcper.com/reviews/Editorial/John-Carmack-Interview-GPU-Race-Intel-Graphics-Ray-Tracing-Voxels-and-more).
The topic of the GPU hardware race came up early in our talk and the response Carmack gave us was pretty interesting. Stating “I don’t worry about the GPU hardware at all, I worry about the drivers” seemed to be a reiterated point. This became very apparent to id Software while developing RAGE where even though the PC had truly an order of magnitude more horsepower than the consoles, it struggled to keep up with the “minimum latency”, get feedback here, update data there, etc and do it all to maintain a 60 Hertz frame rate. DirectX 11 and multi-threaded drivers might have helped things but he still claims that they are far from the solution he envisions: direct surfacing of the memory system. The process of updating a textures on the PC is on the order of “tens of thousands of times slower” than on the Xbox 360 and PS3. AMD did implement a “multi-texture” update specifically for id Tech 5 which should help, but from the interview you can tell that Carmack really does want more done on this topic.
One interesting side effect of this talk – Intel’s integrated graphics actually has impressed Carmack quite a bit and the shared memory address space could potentially fix much of this issue. AMD’s Fusion architecture, seen in the Llano APU and upcoming Trinity design, would also fit into the same mold here. He calls it “almost a forgone conclusion” that eventually this type of architecture is going to be the dominant force. You might remember our discussion of this topic with Josh’s analysis of AMD’s Fusion System Architecture – it would appear that AMD has a potential ally on its side if they are paying attention.
The same situation applies here too. Basically, the Intel GPU driver provides virtual GPU memory that in reality resides in the system ram.
But... you can't get direct access to that memory. The way the driver provides access to this memory is 1000's of percent slower than if the driver were able to point to the real memory address and let you just copy the image directly.
nevcairiel
5th September 2011, 20:57
The main problem here is actually copying stuff back from the GPU memory to the CPU/System memory, which only NVIDIA seems to have really managed to optimize properly for CUDA. Its not a task a game needs, which is why AMD never really cared to invest in it (and therefor is really slow with it). Intel doesn't seem to get that much performance either on the GPU -> CPU copys.
Its probably true that drivers are holding back the true potential of the current and next gen hardware.
egur
5th September 2011, 21:21
CruNcher:
First, thanks a lot for your analysis. That's the best way to get my little SW running properly...
I'd like to explain what I did in FFDShow.
I used the Intel Media SDK v3 beta 3 Direct Show filters sample code. Stripped most of it, fixed several bugs, cleaned it up, some refactoring, put some inline documentation and created a DLL that exports an interface.
My code doesn't use any secret APIs or secret driver GUIDs and doesn't contain any algorithms. It's quite simple and not very big.
Intel's Media SDK uses DXVA1/2 to communicate with the driver/HW (that's what I've heard anyway). What it does is somewhat abstract the horrible DXVA API making this task easier (but not easy!) and use less code.
The (relatively) high CPU usage is caused by one thing - memory copying from the GPU to system memory. I'll try to reduce this by trying to do VPP (DXVA/MSDK video post processing) to a system memory buffer. Hopefully the driver will do the copy faster than memcpy().
My idea with FFDShow is to have a 1 stop decoder that's low on power and high on quality. I want to abstract the HW acceleration and hopefully don't lose too much because of the above frame copying.
I used a profiler to check where the CPU spends its time and most of the time is copying the frame to system memory. A large chunk (25-50%) goes into the renderer's code somewhere. No clue as to why.
Just using DXVA to decode isn't trivial as different splitters behave differently and give different data and maybe the HW decoders aren't following the various specs to the letter. Microsoft's documentation isn't clear enough on how to write things properly. Theoretically they could have created a DXVA decoder themselves, but they didn't. Same goes to Intel/AMD/Nvidia.
My own CPU usage analysis shows that on low/medium bitrates, libavcodec uses less CPU than my implementation, but when bitrates are high (I have only one 26Mbps clip) the CPU usage stays about the same in my decoder and rises in libavcodec.
BTW, if someone know how to copy a frame from the GPU quickly I'd like to know. Since there's no PCIe traffic going on a solution is bound to be found.
nevcairiel
5th September 2011, 21:27
Microsoft's documentation isn't clear enough on how to write things properly. Theoretically they could have created a DXVA decoder themselves, but they didn't.
Oh, but they did. For H264 its called Microsoft DTV-DVD Video Decoder, and ships with Vista/7.
They also have one for VC-1, the WMVideo Decoder DMO, but for some reason this one only uses DXVA in WMP, it must be locked down somehow.
Of course their decoders are "pure" DXVA, which means they don't copy stuff back from the GPU, it remains in there until it is displayed - avoiding the memcpy problem.
egur
5th September 2011, 22:20
...
Its probably true that drivers are holding back the true potential of the current and next gen hardware.
If that was done on purpose then they (Intel/AMD/NVidia) could sell a premium part for more money that doesn't have this limitation and calling it a feature. Most likely a low priority issue that no one wants to spend resources on it (HW or SW).
The reason for the slowness as far as I've heard (aside from the PCIe latency and BW) is that the GPU stores surfaces differently than the CPU. A GPU in many cases needs to work on blocks or tiles (e.g. 8x8 16x16, etc.) and if those pixels are sequential in physical memory then they are read/written much faster and provide higher cache hits as well as efficient cache prefetching. So when a CPU tries to read several bytes each time (inner loop of memcpy) there's a lot of address translations and the memory controller needs to set up the DDR again and again for different pages.
CruNcher
6th September 2011, 00:00
The major issue here is the overhead the driver adds for memory copies.
John Carmack (ID Software) wrote about it in this interview.
Quote:
The topic of the GPU hardware race came up early in our talk and the response Carmack gave us was pretty interesting. Stating “I don’t worry about the GPU hardware at all, I worry about the drivers” seemed to be a reiterated point. This became very apparent to id Software while developing RAGE where even though the PC had truly an order of magnitude more horsepower than the consoles, it struggled to keep up with the “minimum latency”, get feedback here, update data there, etc and do it all to maintain a 60 Hertz frame rate. DirectX 11 and multi-threaded drivers might have helped things but he still claims that they are far from the solution he envisions: direct surfacing of the memory system. The process of updating a textures on the PC is on the order of “tens of thousands of times slower” than on the Xbox 360 and PS3. AMD did implement a “multi-texture” update specifically for id Tech 5 which should help, but from the interview you can tell that Carmack really does want more done on this topic.
One interesting side effect of this talk – Intel’s integrated graphics actually has impressed Carmack quite a bit and the shared memory address space could potentially fix much of this issue. AMD’s Fusion architecture, seen in the Llano APU and upcoming Trinity design, would also fit into the same mold here. He calls it “almost a forgone conclusion” that eventually this type of architecture is going to be the dominant force. You might remember our discussion of this topic with Josh’s analysis of AMD’s Fusion System Architecture – it would appear that AMD has a potential ally on its side if they are paying attention.
The same situation applies here too. Basically, the Intel GPU driver provides virtual GPU memory that in reality resides in the system ram.
But... you can't get direct access to that memory. The way the driver provides access to this memory is 1000's of percent slower than if the driver were able to point to the real memory address and let you just copy the image directly.
Also when we are about the talk on GPU/CPU Efficiency we have to come to the OS itself and it's current Driver architecture and WDDM 1.1 is just the start of this Process the next Windows is going to bring the next step until we some day reach WDDM 2.0 :)
We already had a similar Discussion on Beyond3d and nobody really want's to go to Assembler Style Code the GPU directly anymore, so yeah it's up @ Microsoft and the Vendors to improve this ;)
The (relatively) high CPU usage is caused by one thing - memory copying from the GPU to system memory. I'll try to reduce this by trying to do VPP (DXVA/MSDK video post processing) to a system memory buffer. Hopefully the driver will do the copy faster than memcpy().
:)
My idea with FFDShow is to have a 1 stop decoder that's low on power and high on quality. I want to abstract the HW acceleration and hopefully don't lose too much because of the above frame copying.
Nvidia was very successful with this :)
Just using DXVA to decode isn't trivial as different splitters behave differently and give different data and maybe the HW decoders aren't following the various specs to the letter. Microsoft's documentation isn't clear enough on how to write things properly. Theoretically they could have created a DXVA decoder themselves, but they didn't. Same goes to Intel/AMD/Nvidia.
Yeah true many ISVs know that and some do better then others in those regards, having more open and better documented APIs like Nvcuvid,Open Video and Intel Media SDK are great and hopefully will make this more easy for Devs Lav Cuvid and ffdshow-quicksync are nice examples though Nvidia is still in the lead here and both AMD and Intel came late into the Game. :)
Also it makes it much easier to adapt to new Renderer that doesn't support DXVA and use full capabilities without being limited :)
egur
6th September 2011, 16:30
CruNcher:
Regarding the "evil trees" clip. I get very strange results from different splitters. The LAV splitter reports 59.94 fps while haali and the Gabest MPEG splitter report 29.97.
All splitters produce a cadence of P B T P B .... (progressive, bottom first, top first) and all of them start past the zero time stamp (something like 4 missing frames). I'll dig into this to make sure I behave properly on all of them.
I need a VC1 clip that crashes - like you reported, currently I don't have crashing content. Also, what source filters are used for VC1 (.wmv), the WM ASF Reader freezes too much (regardless of decoder).
I've fixed the seeking issue and now seeks are instantaneous without artifacts.
I also fixed MPEG2 sequence header initialization which will seek corruption.
I'll release a new build in a day or two.
Superb
6th September 2011, 16:47
Not trying to get you down or anything, but why integrate it into ffdshow while LAV Video & LAV CUVID Decoder are the new rising stars around the neighborhood?
Nev (the developer) said he's planning to integrate the two one day (which makes sense; like CoreAVC), and I think it would be wonderful if he'll have a patch available adding SB acceleration as well. It will make it the best video decoder hands down.
What I'm trying to say: think ahead. forward. ffdshow is slowly fading w/ each step LAV Filters take.
I believe the day where codec packs use LAV Filters (instead of Haali & ffdshow) is not that far away.
Or maybe I'm the only one who has noticed it?
pandy
6th September 2011, 17:11
BTW, if someone know how to copy a frame from the GPU quickly I'd like to know. Since there's no PCIe traffic going on a solution is bound to be found.
AFAIR from old PCI times (seems that PCIe is only extension to PCI) reading from PCI device to memory was much slower than writing from PCI device to memory - if there is chance to make PCIe device transaction initiatior and order that PCIe device will write to system memory should IMHO faster than reading from device.
egur
6th September 2011, 21:26
Not trying to get you down or anything, but why integrate it into ffdshow while LAV Video & LAV CUVID Decoder are the new rising stars around the neighborhood?
Nev (the developer) said he's planning to integrate the two one day (which makes sense; like CoreAVC), and I think it would be wonderful if he'll have a patch available adding SB acceleration as well. It will make it the best video decoder hands down.
What I'm trying to say: think ahead. forward. ffdshow is slowly fading w/ each step LAV Filters take.
I believe the day where codec packs use LAV Filters (instead of Haali & ffdshow) is not that far away.
Or maybe I'm the only one who has noticed it?
My work has very little on FFDshows own code. I created a separate DLL that doesn't link with FFDshow or any of its components. FFDshow works very well (that I've noticed anyway) and it was a good start point to me as it doesn't change all the time (actually it does change but with very short merge times on my part). Porting to LAV should be easy, but one thing at a time.
Superb
6th September 2011, 21:38
That's great news. Btw, you might wanna look at VLC's git repository (http://git.videolan.org/?p=vlc.git;a=tree)... They use DXVA2 acceleration and copy the frames back too. (under modules\codec\avcodec\dxva2.c)
CruNcher
7th September 2011, 09:37
@Egur
samples coming, those that crash explicitly with Lav Splitter only (much more but i guess most of those crashes come from the same issue)
http://www.mediafire.com/?94f02bvzqhask37 <- Crash on Load with Lav Splitter
http://www.mediafire.com/?aocp4j26pj6i2qw <- Crash in the middle of playback
Here is something else (not so explosive but should be looked @ anyways):
http://www.mediafire.com/?7ob1wsdt1aon1ou <- Sync issue with Lav Audio
though please don't fix those if that could potentially mean problems for other splitter (or if you think it could) but talk with nevcairiel then first :)
Gser
7th September 2011, 12:28
Well it seems this isn't supported on core i7 860 as the gfx driver won't install.
nevcairiel
7th September 2011, 12:38
though please don't fix those if that could potentially mean problems for other splitter (or if you think it could) but talk with nevcairiel then first :)
Since basically all other codecs play fine, its doubtful at best.
I do things the way i think they are meant to work, not how old stuff was done. Its the only way to break the cycle of old bugs being re-introduced in every new component, just because they took something old as a template.
By doing this, i can play alot more files that just fail on other splitters. If that means ruffing some feathers on some codecs, so be it. :p I provide my own audio and video codecs anyway. :)
With CPUs getting ever so much faster and efficient, the time of hardware video decoders in PCs is nearly over, imho.
The only thing missing really is a good way to use the GPU for deinterlacing without relying on EVR. Thats basically the only reason i still use LAV CUVID myself, for the deinterlacing (and interlaced VC-1 decoding)
Blight
7th September 2011, 13:53
nev:
I agree that it's more important for videos to play correctly than to support buggy code. This is especially true when dealing with new decoders that the dev. is still active.
With regards to CPU vs. Hardware Accel, you're only right on the desktop. With laptops/tablets/cellphones, hardware acceleration allows for a longer battery life.
sneaker_ger
7th September 2011, 14:04
Well it seems this isn't supported on core i7 860 as the gfx driver won't install.
Core i7-860 has the old Lynnfield architecture, not Sandy Bridge, so it was to be expected.
nevcairiel
7th September 2011, 14:12
With laptops/tablets/cellphones, hardware acceleration allows for a longer battery life.
With tablets and cellphones that may be true, however for laptops i'm not 100% sure. Maybe in this generation thats still true, but for the future....
For CPUs, one key factor is also getting more efficient, while GPUs are apparently always going for power. On a desktop PC, the power usage difference today between DXVA2 and a software codec isn't all that big to begin with, so if the CPU gets faster and more efficient at the same time, there might be a point where the power argument is invalid (on PC/Laptop parts - SoC parts for tablets and phones are still far away from that).
Anyhow, i have quite some hope for Intels future CPUs, the Tri-Gate Transistors will be quite a nice boost both in performance and efficiency.
CruNcher
7th September 2011, 17:32
With tablets and cellphones that may be true, however for laptops i'm not 100% sure. Maybe in this generation thats still true, but for the future....
For CPUs, one key factor is also getting more efficient, while GPUs are apparently always going for power. On a desktop PC, the power usage difference today between DXVA2 and a software codec isn't all that big to begin with, so if the CPU gets faster and more efficient at the same time, there might be a point where the power argument is invalid (on PC/Laptop parts - SoC parts for tablets and phones are still far away from that).
Anyhow, i have quite some hope for Intels future CPUs, the Tri-Gate Transistors will be quite a nice boost both in performance and efficiency.
~5W (SB Decoder) vs ~12-15W (the best software decoders) is a difference also for normal Blu-Ray Playback where you would have to add the whole Player overhead (Java,Decryption tasks (not the decryption itself)) too it it adds up :)
And yep Tri-Gate will push that further down nearer to Soc Decoder :)
egur
7th September 2011, 19:03
New version released!
Download version 0.11 alpha:
32 bit http://www.multiupload.com/FCBQAAARUI
64 bit http://www.multiupload.com/6O3BXXXPAC
Revision history:
v1.11:
* Fixed skipping issues. Seeks are now instant.
* Fixed handling of sequence header for all supported formats. Fixes image corruption in some clips.
* Created 64bit version. Very limited testing was done with this one.
egur
7th September 2011, 19:05
CruNcher, I'll look into the crashes tomorrow. Thanks a lot for your help:thanks:
egur
7th September 2011, 20:33
...
The only thing missing really is a good way to use the GPU for deinterlacing without relying on EVR. Thats basically the only reason i still use LAV CUVID myself, for the deinterlacing (and interlaced VC-1 decoding)
I've considered adding HW deinterlacing to the decoder, it's not too complicated. But the extra copying I'll have to do, renders this solution a bad one (ATM).
If you want, I can export the D3D surface w/o copying and apply some post processing on it including DI. Enabling video post processing is high on my list after root causing the current bugs.
CruNcher, I'll look into the crashes tomorrow. Thanks a lot for your help:thanks:
CruNcher
9th September 2011, 16:23
@egur
the 64 bit version doesn't work it falls back to other decoder in the directshow chain if ffdshow-quicksync is selected for the format (libavcodec works)
Here is another stream that has problems with ffdshow-quicksync
http://www.mediafire.com/download.php?cla9ncy0m1tb89w <-stops @ start no playback possible
egur
9th September 2011, 16:27
@egur
the 64 bit version doesn't work it falls back to other decoder in the directshow chain if ffdshow-quicksync is selected for the format (libavcodec works)
I did very limited testing - only graph edit and it worked for a few clips. I'll try mpc-hc x64. Any other players to test? BTW, what was the setup (filters, content,etc) so I can reproduce quickly?
BTW, I reproduced the crashes with LAV splitter on your samples but didn't have the time to debug yet.
CruNcher
9th September 2011, 16:40
MPC-HC 64 Bit 3704
MPC-HC splitter standalone 64 bit (not internal, though should be the same as MPC-HC 3704 internal)
Lav Splitter 0.35 64 bit
Lav Audio 0.35 64 Bit
Renderer: Stability testing EVR (default)
Renderer: Shader Processing tests: EVR-CP
MPC-HC 64/32 3704 + standalone filters (binaries) can be found here http://xhmikosr.1f0.de/index.php?folder=bXBjLWhj
ffdshow 64 bit i used to replace your quicksync components with was http://sourceforge.net/projects/ffdshow-tryout/files/SVN%20builds%20by%20clsid/64-bit%20builds/ffdshow_rev3978_20110825_clsid_x64.exe/download
Most important dshow players are based on these components anyways ;)
you could also test with ongoing AVsplitter http://avsplitter.avmedia.su/en it's like lav splitter based on libavformat (uni*), MPC-HC is native windows based code ;)
Nothing connects to the ffdshow-quicksync 64 bit Decoder via MPC-HC 64 bit internal and standalone 64 bit filters (container doesn't matter format either) :(
32 bit same framework no connection issues
Btw here is a result from Intels MFT Decoder (copy overhead):
Renderer: Enhanced Video Renderer (Media Foundation)
Decoder: Intel® Hardware H.264 Decoder MFT
Decoder Device: ModeH264_VLD_NoFGT
Processor Device: ProgressiveDevice
Time: 00:05.685
Average FPS: 177,130
Min/Max FPS: Min: 170 Max: 178
CPU Usage (%): Avg: 36 Min: 33 Max: 38
In compare no Copy overhead DXVA2:
Renderer: Enhanced Video Renderer (Media Foundation)
Decoder: Microsoft H264 Video Decoder MFT
Decoder Device: ModeH264_VLD_NoFGT_ClearVideo
Processor Device: ProgressiveDevice
Time: 00:02.238
Average FPS: 367,161
Min/Max FPS: Min: 343 Max: 386
CPU Usage (%): Avg: 09 Min: 07 Max: 12
in direct compare to Cyberlinks DXVA2:
Renderer: Enhanced Video Renderer (DirectShow)
Decoder: CyberLink Video Decoder
Decoder Device: ModeH264_VLD_NoFGT_ClearVideo
Processor Device: ProgressiveDevice
Time: 00:02.655
Average FPS: 379,183
Min/Max FPS: Min: 368 Max: 383
CPU Usage (%): Avg: 03 Min: 02 Max: 04
Current ffdshow-quicksync (copy overhead):
Renderer: Enhanced Video Renderer (DirectShow)
Decoder: ffdshow Video Decoder
Decoder Device: -
Processor Device: ProgressiveDevice
Time: 00:22.403
Average FPS: 44,948
Min/Max FPS: Min: 44 Max: 45
CPU Usage (%): Avg: 24 Min: 23 Max: 25
egur you should ask the guys that made the mft decoder (i guess they are part of the driver team and or sdk) how they optimized the performance @ the little higher overhead, though i would say this is currently the farest you could get in Performance optimization with ffdshow-quicksync .
in comparison here is the Libavcodec decoder efficiency on the 4 cores :)
Renderer: Enhanced Video Renderer (DirectShow)
Decoder: LAV Video Decoder
Decoder Device: -
Processor Device: ProgressiveDevice
Time: 00:03.456
Average FPS: 291,299
Min/Max FPS: Min: 285 Max: 285
CPU Usage (%): Avg: 83 Min: 83 Max: 83
iwod
12th September 2011, 12:50
CruNcher is an SNSD Fans :eek::eek::eek::eek::eek:
And What movie is test.ts ?? :cool:
egur
12th September 2011, 21:54
Hi CruNcher,
I've fixed some of the problems and I'll release a new version tomorrow.
• 64bit version is working. The 64 bit version was built wrong - fixed and now it works in MPC-HC x64 (using latest version which is older than yours, BTW).
• Optimized CPU usage (faster copying from GPU to CPU). Changed memcpy to an SSE4.1 implementation - much faster, but I don't have numbers yet (now it's faster then libavcodec or an average 720p movie).
• More stable with LAV splitter. Previous version crashed on several MPEG2 transport with AVC1 (H264) video. AVC header parsing is more robust (Media SDK bug or LAV filter bug).
• Added time stamp stabilizing (transport stream issues).
• Added adaptive inverse telecine (29.976 --> 23.97) when stream reports it. And fall back to the original frame rate when the content is "normal" (no repeating fields). This works great on the smple you've sent.
BTW, using Shader/GPGPU video processing is asking for trouble with the HD2000/3000. It's not comparable to the mainstream or high end cards. Even the simple Haali Video Renderer produces 7(!) fps (720p to a little higher resolution) on my laptop regardless of decoder.
If you can point me to some VC1 clips, I'd appreciate it.
egur
12th September 2011, 21:58
That's great news. Btw, you might wanna look at VLC's git repository (http://git.videolan.org/?p=vlc.git;a=tree)... They use DXVA2 acceleration and copy the frames back too. (under modules\codec\avcodec\dxva2.c)
Thanks!
I looked at the VLC code and found out they use an SSE4.1 instruction to copy from the GPU memory. I had to rewrite using SSE4 intrinsics so 64 bit compilations would work. Results are nice, Now I'm always faster then libavcodec on 720p (and north) videos.
ajp_anton
12th September 2011, 23:11
What exactly does this do?
With all this talk about copying frames from GPU to main memory, I get the impression that it's kind of like Nvidia's "CUDA" decoding, but it doesn't seem to be working properly.
CPU usage on a ~30Mbit 1080p video (i7-2600K):
"Quicksync": 10%
ffdshow (libavcodec): 7%
LAV: 6%
DXVA (MPC-HC): 0%
nevcairiel
13th September 2011, 06:37
Thanks!
I looked at the VLC code and found out they use an SSE4.1 instruction to copy from the GPU memory. I had to rewrite using SSE4 intrinsics so 64 bit compilations would work. Results are nice, Now I'm always faster then libavcodec on 720p (and north) videos.
Yeah that SSE 4.1 instruction is great for this task. Intel really knows what they're doing. :)
Blight
13th September 2011, 10:38
anton:
This is exactly what this is, the Intel sandybridge equivalent of nvidia's "CUDA" decoding.
It's an initial build, things will get better as more content is tested.
nevcairiel
13th September 2011, 14:18
I've been thinking about this thing today, and i've been wondering - what exactly does the Media SDK offer over a DXVA2 decoder (assuming you copy the frame back into system ram as well) ?
I'm only interested in actual user visible advantages, i realize coding might be simpler with the SDK, but then DXVA2 works with more GPUs. ;)
PS:
Its not the same as CUDA decoding, CUDA is handled quite differently. As i understand it, the MSDK is just a "wrapper" around DXVA2, hence my question.
egur
13th September 2011, 21:19
I've been thinking about this thing today, and i've been wondering - what exactly does the Media SDK offer over a DXVA2 decoder (assuming you copy the frame back into system ram as well) ?
I'm only interested in actual user visible advantages, i realize coding might be simpler with the SDK, but then DXVA2 works with more GPUs. ;)
PS:
Its not the same as CUDA decoding, CUDA is handled quite differently. As i understand it, the MSDK is just a "wrapper" around DXVA2, hence my question.
As far as I know it should be a more user friendly wrapper. Maybe cleanup stream errors, etc. You know that using DXVA naively doesn't work well.
DXVA is the implementation underneath. Maybe some day, DXVA will be replaced or Media SDK will be enabled on other platforms so using a wrapper speeds porting as well as writing an application. There's also the chance that DXVA will be too limited compared to the HW capabilities and media SDK will wrap another API (I'm guessing here).
There's very little programmable code that runs in the Intel GPU, most of the decoding/emcoding/VPP is done by ASIC (fixed function HW). That's why its so fast even when compared to a 250W GPU.
CruNcher
14th September 2011, 08:39
There's very little programmable code that runs in the Intel GPU, most of the decoding/emcoding/VPP is done by ASIC (fixed function HW). That's why its so fast even when compared to a 250W GPU.
Hmm though according to another Intel Engineer (Francois Piednoel , Senior Performance analyst at Intel Corp Santa Clara) there should have been a possibility to use these functions (execute on them) outside in your own Encoder Code for example (not using the whole Intel Quicksync Encoder @ all) :)
X264 could have benefited from that (direct acceleration on the ASIC) but it never happened to bee sadly.
This "Intel guy" disappeared after Dark Shikari asked him about low level QuickSync API.
Since the original Intel failure, I have learned quite a bit more about the lower-level details, and I'd quite love to explain more, but unfortunately I am now deep into NDA territory. If this means people are going to blame x264 for QuickSync's failings, well, unfortunately there's not much I can legally do about it anymore.
does that mean you are now technically able to allow some parts of x264 encoding to be done by quicksync? If so is this support going to be added?
Maybe yes, probably not. There are some pretty devastating technical limitations.
This could have been a big hit for Intel now AMD seems to be more open and taking the chance of giving full support which is not really surprising seeing they have nothing like Quicksync yet (which @ least can reach x264 superfast quality) (or not confirmed) and only their GPU Encoder which wouldn't be up against it :D
Egur here are the VC-1 samples:
http://www.mediafire.com/download.php?4m9cb10oms48bv1 <--Frame Interlaced/Progressive VC-1
http://www.mediafire.com/download.php?1uc5b42u55ue280 <-- Field Interlaced VC-1
Both sync problems (MPC-HC Splitter) the Field Interlaced also shows decoding issues. (was before the silent updates rechecking with 0.12)
Nice evil trees.ts plays wonderful smooth even without manually correcting it and with auto interlaced flags send (perfectly telecined) from ffdshow on EVR Custom (so other streams get properly double framerate deinterlaced).
Thats something no current DXVA Decoder can do on EVR Custom ;) (not without losing double framerate deinterlacing)
With Lav Splitter it fails on EVR custom though works only with MPC-HC Splitter for now :(
egur
14th September 2011, 10:17
New and improved version. Zip files contain documentation, please read.
Download version 0.12 alpha:
32 bit http://www.multiupload.com/5L5NL03997
64 bit http://www.multiupload.com/41UGJ3TQMI
Revision highlights:
v1.12:
* 64bit version is working.
* Optimized CPU usage (faster copying from GPU to CPU)
* More stable with LAV splitter. Previous version crashed on several MPEG2 transport with AVC1 (H264) video.
* Added time stamp stabilizing (transport stream issues).
* Added inverse telecine when stream has the right flags.
CruNcher
14th September 2011, 11:33
@egur
the MC.ts is still problematic (wrong decoded)
the CD.ts is fine also sync wise :)
http://www.mediafire.com/?37kyc94d6n22tkf <- stops @ start
also i have a sample (very bad condition one) where i don't understand why Deinterlacing doesn't work on EVR Custom with ffdshow-quicksync but works fine on EVR (as if something adaptive would work on EVR (no flags needed) that's not being used on EVR Custom)
Upload of that one in progress (playback btw is fine for how corrupted this is only the Deinterlacing EVR Custom failing is what makes me wonder, several others shows this behavior too, though bitstream wise all are correct flagged though still sometimes EVR Custom Deinterlacing works sometimes it fails, when i look directly @ it and compare it seems for Mpeg-2 it always works but for H.264 it seems to fail interesting)
Does it mean Adaptive Deinterlacing works only on EVR and is it maybe possible to make it usable (from within Intels Drivers to work for other Renderer like EVR Custom as well ????)
Ahh seems the Problem is in ffdshow-quicksyncs MBAFF handling :)
Yep normal Interlaced Streams get correctly Deinterlaced on EVR Custom (Interlaced(PAFF))
MBAFF streams fail and only get correctly Deinterlaced on EVR
(hmm not sure yet but it seems the telecine fails on smooth cuts (fades) doesn't feel right on EVR Custom @ least)
Yep again EVR results are much better
ffdshow-quicksync telecine Mpeg-2 EVR:
http://img855.imageshack.us/img855/707/ffdshowquicksynctelecin.png
ffdshow-quicksync Telecine Mpeg-2 EVR Custom (fail):
http://img97.imageshack.us/img97/707/ffdshowquicksynctelecin.png
I guess that will be interesting to compare vs Nvidia :)
egur
14th September 2011, 12:54
CrunNcher:
I have a small bug in the memcpy function on 32 bit. You'll see a corruption on the right side (less than 128 pixel wide stripe on the rightest side).
I also improved the speed a little bit so I'll release again soon.
I'll check the new clips.
I saw that one of them was heavility corrupted (MC.ts). One played fine (CD.ts).
Do you have any info on MBAFF that can help me?
Aslo regarding invserse telecine, what do you mean by slow fades? I don't perform image analysis, just look at the flags.
If a few frames pass and there's no "repeat field" flag, than I drop out of IVT back to hte original frame rate.
I think I'm missing some code that deals with format change during playback. I'll add that too.
tetsuo55
14th September 2011, 13:40
Hello Eric,
First of all i want to say i am happy to see a release of a more stable dxva decoder for sandybridge and higher gpu's.
Also i cannot wait for the source to be released so this code can be integrated in arguably superior codecs.
I am a little bit confused though. Historically Intel has been working with Casimir of MPC-HC for integration of its DXVA codecs, has anything changed in this regard?
nevcairiel
14th September 2011, 14:00
LAV Video will get support for decoding through Intels Media SDK sooner or later, be it with Erics help or without.
Like Eric said, he didn't use any magic, he just implemented a decoder based on the publicly available SDK.
egur
14th September 2011, 14:00
Hello Eric,
First of all i want to say i am happy to see a release of a more stable dxva decoder for sandybridge and higher gpu's.
Also i cannot wait for the source to be released so this code can be integrated in arguably superior codecs.
I am a little bit confused though. Historically Intel has been working with Casimir of MPC-HC for integration of its DXVA codecs, has anything changed in this regard?
This work is my own initiative and it's aligned with Intel's interests as well as the users.
BTW, 100,000 people work at Intel and I don't know most them (or Casimir)...
Several groups support companies and open source projects, I'm awareof only a handful of people. I myself work in OEM support for the CPU but this is irrelevant to my little project here.
Source code will be sent if requested. It's meant for everyone to see, modify and use for free in either open or closed source projects.
I think it's a little early to send the source cose as it changes quite a bit due to feedback. But if you want it, I can send it to you.
tetsuo55
14th September 2011, 16:49
Thanks for the reply's.
I'm not in a rush to see the sourcecode, but as an open-source project manager i would like to see your work on a source management platform and licenced with GPL as soon as possible.
You can choose any site that you like; github, google code, etc...
Personally i tend to follow the commitlog for projects more than forum chat, as it is more condensed and to the point, plus i get to review the actual code changes. This will also give you the possibility to have people open tickets and attach small samples, etc...
Thanks in advance and good luck with this project!
Do you think this code will at some later point in any way help older DXVA implementations of pre-sandy-bridge hardware?
squid_80
16th September 2011, 17:00
Since ffdshow is licensed under the GPL and you are distributing modified builds of it, I think you are required to make the source available regardless of anyone requesting it or not. At least that's the point of view a certain moderator on this forum took with my work in the past.
vBulletin® v3.8.11, Copyright ©2000-2025, vBulletin Solutions Inc.