View Full Version : Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing
JanWillem32
9th October 2011, 09:50
A virtual monitor driver is new to me. I've seen functions to force enable an analog "tv" output, though. It installs a standard VGA monitor on an adapter and forces it to output.
You can try something else first. If the combination of GetAdapterCount and GetAdapterIdentifier doesn't return the adapter you are looking for, EnumAdapters1 probably will (although I've never used it before): http://msdn.microsoft.com/en-us/library/ff471336%28v=VS.85%29.aspx . I don't know if the DXVA helper function can work on a DXGI-derived device. I've never tried to derive one for functions like that.
CruNcher
9th October 2011, 14:34
hmm maybe this can be somehow helpfull http://channel9.msdn.com/Events/BUILD/BUILD2011/SAC-217T http://channel9.msdn.com/Events/BUILD/BUILD2011/HW-220C ? though these talks are more targeted @ Windows 8 and it might be better to look @ the PDC 2008 talks and the introduction of 7 though only these talks are interesting multimedia wise http://channel9.msdn.com/Events/PDC/PDC08/PC04 http://channel9.msdn.com/Events/PDC/PDC08/PC05 http://channel9.msdn.com/Events/PDC/PDC08/PC07 though only the 2011 talks go deep into the WDDM and virtual display driver.
egur
9th October 2011, 16:59
New and improved version. Zip files contains installer and documentation, please read.
Download version 0.15 alpha:
32 bit http://www.multiupload.com/SW88AXIEAR
64 bit http://www.multiupload.com/3QH5R6N6CD
Source code http://www.multiupload.com/GQBEQ161DB
Revision highlights:
v1.15:
* Rewrote time stamp handling code. Decoder now calculates frame rate if missing, corrects for splitters reporting double frame rate for interlaced content. Handles PTS and DTS time stamps. Broken streams that alternate frequently between telecined and interlaced frames are not handles perfectly (yet!).
* Handled unsupported H264 formats by reverting to libavcodec silently within ffdshow. HW acceleration is limited to H264 simple, main and high profiles. Previous version would crash on unsupported formats.
* Added support for WMV3 (part of the VC1 HW decoder).
* Various bug fixes and better decoder error handling. As reported by various users for the 0.14 release.
* Cleaned up minor memory leaks.
CruNcher
9th October 2011, 20:51
Egur nice :) also some progress on my QuickSync Decode/Transcode Capture Framework (this is really the nicest piece of Hardware i ever used, Z510 and US15W was already impressive to work with but this kills everything, just thinking about ivy bridge and haswell and more performance @ lower watt geez) :)
http://www.mediafire.com/?fitumy3c9qf3p31
egur
10th October 2011, 14:26
I succeeded in running the QS decoder on the Intel GPU and EVR on a discrete GPU (Radeon HD6950).
Steps to reproduce:
* Connect discrete card to monitor.
* Connect IGP to a second input on the same monitor, wait for the driver to recognize it (might need to manually switch the monitor input on the monitor itself). Win7 extends the desktop the "new" monitor.
* Switch back to the main input.
* Play video.
* Test setup by messing with controls on the discrete GPU control panel (e.g. lower saturation to zero).
This survived a reboot so it's a one time setup.
Both control panels for the GPUs are now functional.
I'll test this with MadVR today and update my post.
v0.15 is not compatible with this setup (bug - couldn't test this :( ), but next version will support it.
If it's very important, I'll release it ASAP.
nevcairiel
10th October 2011, 15:35
There is one really annoying thing with that setup though. There is no longer a mouse boundary, as your desktop permanently expands to the second screen, even if its not selected as an active input.
egur
10th October 2011, 15:49
There is one really annoying thing with that setup though. There is no longer a mouse boundary, as your desktop permanently expands to the second screen, even if its not selected as an active input.
True - an ugly hack. Until a SW solution is found, this can be good for testing/evaluation.
ajp_anton
10th October 2011, 19:37
But what about falling back to another decoder (not just libavcodec within ffdshow) for unsupported streams?
ffdshow can't output some of them directly so it converts them to RGB. LAV video works better there, and is also faster.
egur
10th October 2011, 22:14
But what about falling back to another decoder (not just libavcodec within ffdshow) for unsupported streams?
ffdshow can't output some of them directly so it converts them to RGB. LAV video works better there, and is also faster.
My decoder only outputs NV12 and decodes several types of streams. If it fails to initialize for any reason ffdshow will choose the default internal decoder - usually libavcodec.
Output conversion (raw video) has nothing to do with my code, it's negotiated with the filter connected downstream. The downstream filter decides the raw format.
What do you mean by LAV is faster? What scenario?
nevcairiel
11th October 2011, 06:31
He wants to use LAV for 10bit H264 and other formats not compatible with hardware decoding, because ffdshow has some limitations decoding those
My short answer would be to wait until LAV supports Intel MSDK as well..... :)
CruNcher
11th October 2011, 08:05
egur like i said before make it switchable so the user can decide to either fallback to libav or the dshow chain directly (just ignore the connection if switch is set) ;)
ajp_anton
11th October 2011, 22:01
My decoder only outputs NV12 and decodes several types of streams. If it fails to initialize for any reason ffdshow will choose the default internal decoder - usually libavcodec.
Output conversion (raw video) has nothing to do with my code, it's negotiated with the filter connected downstream. The downstream filter decides the raw format.
What do you mean by LAV is faster? What scenario?What I'm after:
Quicksync compatible? Use it.
Not compatible? Don't use ffdshow at all, use LAV instead.
Why?
With 10-bit video, LAV video is faster, and ffdshow can't output either 10-bit or 4:4:4 without going to 8-bit or RGB.
QS in LAV when? =)
egur
11th October 2011, 23:16
What I'm after:
Quicksync compatible? Use it.
Not compatible? Don't use ffdshow at all, use LAV instead.
Why?
With 10-bit video, LAV video is faster, and ffdshow can't output either 10-bit or 4:4:4 without going to 8-bit or RGB.
QS in LAV when? =)
Now, it's clear. I'll add a checkbox option for ffdshow's codec page to decline a connection in such cases. Default behavior will be fall back to libavcodec or other internal decoder.
Does this have any relevance to other formats (VC1, MPEG2)?
CruNcher
12th October 2011, 00:19
Now, it's clear. I'll add a checkbox option for ffdshow's codec page to decline a connection in such cases. Default behavior will be fall back to libavcodec or other internal decoder.
Does this have any relevance to other formats (VC1, MPEG2)?
For the Mpeg-2 Studio Profile fallback :)
kieranrk
12th October 2011, 07:26
For the Mpeg-2 Studio Profile fallback :)
Such a thing does not exist.
CruNcher
12th October 2011, 07:34
Such a thing does not exist.
in layman terms it does anyways 4:2:2 High Level ;)
egur
12th October 2011, 22:05
in layman terms it does anyways 4:2:2 High Level ;)
Can you supply a short clip MPEG2 4:2:2 clip for testing?
TPoise
14th October 2011, 03:48
I saw the earlier post about VC1 video corruption. Just wanted to add my two cents along with a sample file and pic.
Using the v0.15 alpha version.
Intel HD3000 (Core i7-2600QM)
Windows 7 SP1
Intel Display Driver v8.15.10.2342 (the latest according to Dell)
Link to a clip (http://www.megaupload.com/?d=RS22AQBF)
http://www.legacygeeks.com/images/florida_corrupt.png
nevcairiel
14th October 2011, 06:12
AVC1 is not VC1 .. MS really used a confusing name there. :)
TPoise
15th October 2011, 02:53
AVC1 is not VC1 .. MS really used a confusing name there. :)
Do I have a legitimate issue then? AVC1 is what shows up for any H.264 source I use. I am obviously a newbie.
nm
15th October 2011, 11:13
Do I have a legitimate issue then? AVC1 is what shows up for any H.264 source I use. I am obviously a newbie.
It may be a legitimate issue, but not the VC-1 problem that was described earlier. H.264 (sometimes signaled by fourcc AVC1) is not VC-1.
CruNcher
15th October 2011, 11:59
@Egur
im a little confused i got a I5-2400 shouldn't that be a GT1 ?
http://gpuz.techpowerup.com/11/10/15/2g2.png
lot of these data seems to make no sense also the clocks seem wrong detected :(
Tough it has a GPU usage display for Sandy Bridge now i wonder if it's though the same as the OS (Vista/7) is using from DWM or a Intel Driver Supplied one :)
Looks different (almost half more utilization, compared to the OS sensor) does the Sensor include the DSP Decoding ?
Nvidia for example is strictly differentiating here between GPU and VPU load i guess would be good if Intel does it as well:
So if the GPU load on DWM is GPU only then it would mean the DSP is loaded round about 30% or is that Sensor pure GPU EU load (Bicubic PS scaling is at work here) ?
http://img811.imageshack.us/img811/2912/gpuusage.png
Correct GPU Clock:
http://img511.imageshack.us/img511/5748/correctgpuclock.png
@Egur
is it possible to update only the GPU Bios part directly via the IME ?
egur
15th October 2011, 18:47
I saw the earlier post about VC1 video corruption. Just wanted to add my two cents along with a sample file and pic.
Using the v0.15 alpha version.
...
Intel Display Driver v8.15.10.2342 (the latest according to Dell
I couldn't reproduce the corruption on v0.15 or my not yet released dev build. Tried 64 and 32 bit.
Maybe it the old driver. I test on 2509 (latest) and 2372 (April) drivers.
You can try installing the latest generic Intel driver (2509 or newer) from the Intel website. I'm not sure what's the difference between the standard driver and the Dell driver, but it's usually quite safe to upgrade. If new driver is not working well, reinstall Dell's driver.
TPoise
15th October 2011, 21:07
I couldn't reproduce the corruption on v0.15 or my not yet released dev build. Tried 64 and 32 bit.
Maybe it the old driver. I test on 2509 (latest) and 2372 (April) drivers.
You can try installing the latest generic Intel driver (2509 or newer) from the Intel website. I'm not sure what's the difference between the standard driver and the Dell driver, but it's usually quite safe to upgrade. If new driver is not working well, reinstall Dell's driver.
I did the upgrade to the generic 2509 driver. Still see the corruption. One more thing that I failed to say in my original post--this laptop has a discrete Nvidia GT525M, so it uses NVidia Optimus. I have the Global Settings set to use "Integrated Graphics" and I can confirm the GPU usage on the discrete Nvidia card is at 0%.
egur
15th October 2011, 21:24
I did the upgrade to the generic 2509 driver. Still see the corruption. One more thing that I failed to say in my original post--this laptop has a discrete Nvidia GT525M, so it uses NVidia Optimus. I have the Global Settings set to use "Integrated Graphics" and I can confirm the GPU usage on the discrete Nvidia card is at 0%.
I've frame stepped at the 3 times the Florida logo appears in the clip, and no corruption. I tried two different splitters as well...
I've noticed AVC1 corruption on ts files during seeks or at the start of the clip, but this clip doesn't show any of these artifacts.
Can anyone else confirm the corruption?
egur
15th October 2011, 22:38
im a little confused i got a I5-2400 shouldn't that be a GT1 ?
GT1 it is. The process is 32nm not 45 and it doesn't support DirectX 11 (GT supports DX10.1). Also it says it doesn't support OpenCL and I'm quite sure it does.
Looks different (almost half more utilization, compared to the OS sensor) does the Sensor include the DSP Decoding ?
Nvidia for example is strictly differentiating here between GPU and VPU load i guess would be good if Intel does it as well:
So if the GPU load on DWM is GPU only then it would mean the DSP is loaded round about 30% or is that Sensor pure GPU EU load (Bicubic PS scaling is at work here) ?
Not sure what you mean exactly, but HW decoding and most of the post processing are in fixed function and don't register as GPU load. Also, I'm not sure how accurate these measurements are.
@Egur
is it possible to update only the GPU Bios part directly via the IME ?
Don't know and probably not a good idea unless it's specifically supported. The BIOS image is made out of many parts that get validated as a whole. Do you have VBIOS issues that a new VBIOS version corrects?
CruNcher
16th October 2011, 08:44
I've frame stepped at the 3 times the Florida logo appears in the clip, and no corruption. I tried two different splitters as well...
I've noticed AVC1 corruption on ts files during seeks or at the start of the clip, but this clip doesn't show any of these artifacts.
Can anyone else confirm the corruption?
Nope no issues or corruption besides this stream is awful quality to begin with and totally over inlooped ;)
GT1 it is. The process is 32nm not 45 and it doesn't support DirectX 11 (GT supports DX10.1). Also it says it doesn't support OpenCL and I'm quite sure it does.
Yeah lot of issues other do it much better then Wizzard currently ;) ill post a bug report
Not sure what you mean exactly, but HW decoding and most of the post processing are in fixed function and don't register as GPU load. Also, I'm not sure how accurate these measurements are.
Yes i know but the load should be measurable there also shouldn't it, especially i have doubts that no one @ intel need those load data for almost live playback measurements ;) ?
What would you suggest to measure the different GT1/2 states (GPU,DSP,Memory) ?
Im gonna ask Wizzard where he gets those measurement from but i guess it's a NDA thing (well see)
Don't know and probably not a good idea unless it's specifically supported. The BIOS image is made out of many parts that get validated as a whole. Do you have VBIOS issues that a new VBIOS version corrects?
Nope just asked though i also didn't find any Dump of a Intel GT1 bios yet :P
And indeed i would also see no reason to update i mean this Windows Intel system is more stable then anything i used before ever :)
I also did a small test to confirm the GPU-Z measurements
http://img7.imageshack.us/img7/7927/gpuzcorrect.png
looks more accurate indeed (as the feeling here is the PS is pushing the GT1 over the top and it crawls to it's feet @ 1080p) i wonder why the headroom though for the OS DWM measurement is exactly somewhere 50% more :D somehow i guess if i get that too 100% it's latency gonna explode and Aero is gonna error or turn off ;)
Overloading the GT1 has a effect on the overall Performance, but the OS DWM measuring doesn't seem to change anymore even with more load ? :)
No load = 6 seconds
http://img249.imageshack.us/img249/7703/noengine1t.png
ffdshow-quicksync + PS (EVR-CP PS Bicubic) + Pre-Resize Sharpen Complex 2 @ 1080p = 12 seconds
http://img851.imageshack.us/img851/8388/renderperformancedrop.png
:devil:
http://img822.imageshack.us/img822/1278/moreload.png
So could be Engine 1 the fixed function load ?
Yup Engine 1 seems to be the fixed function Decoder load, turning it off gives you the same measurement as GPU-Z for the EUs only load :)
Though that turning of the PS offloads the load somewhat to Engine 1 seems strange, why should the Decoder load get lower with PS on ?
http://img690.imageshack.us/img690/4716/fixedloadchange.png
Can you supply a short clip MPEG2 4:2:2 clip for testing?
Sorry missed that http://www.megaupload.com/?d=V93PLAO2
Try also that your ffdshow when falling back for Mpeg-2 Studio to libmpeg2 priority is the native YUY2 output to keep CPU load as low as possible @ Playback :D (ffdshows colorspace conversion aren't optimal in performance, and if they can be avoided why forcing something like YV12 or NV12 like it's currently being done) currently it does by default YUY2->YV12 which seems crazy and only eats Performance :(
You are optimizing for a Intel Framework here so if the Renderer does a better job why do a extra (slow) conversion (NV12,YV12) where it isn't needed ;)
If you do it like said it would also beat the current Lav Video (at least in Power Consumption) in native YUY2 :)
Of course there is also the Hard way to optimize the Colorspace Conversion ASM for Intel SB entirely (to gain really the last drop) :D
Mainconcept:
YV12 = 154 FPS (~9% @ 59.95 fps)
Lav Video:
YV12 = 118 FPS (~13% @ 59.95 fps)
YUY2 = 150 FPS (~12% @ 59.95 fps)
FFdshow-Quicksync (libmpeg2):
YV12 = 96 FPS (~13% @ 59.95 fps)
YUY2 = 110 FPS (~10% @ 59.95 fps)
FFdshow-Quicksync (libavcodec):
YV12 = 134 FPS (~15% @ 59.95 fps)
YUY2 = 145 FPS (~13% @ 59.95 fps)
interesting libmpeg2 seems more efficient in power consumption compared to libavcodec which is tough higher performance ?
TPoise
16th October 2011, 16:39
I've frame stepped at the 3 times the Florida logo appears in the clip, and no corruption. I tried two different splitters as well...
I've noticed AVC1 corruption on ts files during seeks or at the start of the clip, but this clip doesn't show any of these artifacts.
Can anyone else confirm the corruption?
Does anybody have a laptop that can confirm (or deny) the corruption? I wonder if it is a NVidia Optimus issue. I'm not privy to the full details of how it Optimus works, but if it intercepts direct calls to either the HD3000 GPU or the Nvidia GT525M GPU then could it possibly cause the corruption that I'm seeing?
FYI, when I switch to DXVA using the the HD3000 GPU (not Ffdshow), just the basic DXVA hardware rendering as part of Media Player Classic, I see no corruption. Not only do I see no corruption, but my CPU usage is very low (around 1-2%) as well as temps/voltages are low when measured using CoreTemp.
I also see no corruption when using libavcodec when using the ffdshow filters from egur.
I'm using MPC, 64-bit edition v1.5.2.3456
CruNcher
16th October 2011, 18:09
Optimus also has Compression capabilities of PCI-E transfers though not sure if that might interference here (its proprietary stuff) to enhance the Performance it only works on x1 connections. http://forum.notebookreview.com/gaming-software-graphics-cards/418851-diy-egpu-experiences-123.html#post6542661
I hope we hear something soon from what happened to Synergy for Desktops http://vr-zone.com/articles/nvidia-to-launch-desktop-optimus--synergy-at-computex/11946.html :(
And yes sure DXVA is going to less stress your system then ffdshow-quicksync does and when maximum Power Consumption is your goal it should be preferred for Playback, if for any reason flexibility is or something else then ffdshow-quicksync might be a good solution. You shouldn't just use it because you want to you should know your goals, of course if your goal is to help improve it that's great :)
vivan
16th October 2011, 20:53
I have acer 3830TG (i5-2410M + nVidia 540M = Optimus).
I'm using 32-bit (since there is no sense in using x64 version) versions of MPC-HC and decoder from here - can't reproduce your problem. I even installed x64 versions - everything is still ok...
So, as for me, everything works perfectly. The only problem I'm experiencing is with video with variable framerate - audio/subs are out of sync :(
egur
16th October 2011, 21:58
I'm using MPC, 64-bit edition v1.5.2.3456
Any special reason to use 64 bit?
Can you try the 32 bit version? My decoder will be a little faster in 32bit as I've optimized the copy function in ASM. 64 bit use intrinsic functions but the compiler isn't 100% efficient using them.
Boltron
17th October 2011, 17:59
What performance monitor utility are you using that shows Summary, CPU, Memory GPU and also the GPU Engine History?
LoRd_MuldeR
17th October 2011, 20:35
ProcessExplorer? (http://technet.microsoft.com/en-us/sysinternals/bb896653)
And, if you want a more detailed analysis on how many CPU cycles have been spent in each function, you could have a look at Code Analyst:
http://developer.amd.com/tools/CodeAnalyst/pages/default.aspx
(Although it is an AMD tool, it works on Intel CPU's just as well. Just make sure you use it with a Debug build, if you want function names!)
Boltron
17th October 2011, 21:27
Wow, ProcessExplorer sure looks different from the last time I used it. This is so cool. Thx!
TPoise
18th October 2011, 01:22
Any special reason to use 64 bit?
Can you try the 32 bit version? My decoder will be a little faster in 32bit as I've optimized the copy function in ASM. 64 bit use intrinsic functions but the compiler isn't 100% efficient using them.
Used 32-bit and get the same video corruption.
egur
19th October 2011, 13:48
I managed to solve the multi GPU problem without cables. You'll need v0.16 or newer to make this work.
1) You need to set up another (fake) screen. Right click on desktop->screen resolution.
2) Click the Detect button. Unconnected screens will appear.
3) Extend desktop to a VGA connection on the Intel GPU (screen 2 in the image).
4) Drag the 2nd screen to the corner of the primary screen so the mouse boundaries of the primary screen will remain (almost) the same.
5) Click OK/Apply. A reboot is recommended.
http://img14.imageshack.us/img14/6519/displaysettings.png
6) Open your favorite player and select MadVR or other GPU demanding renderer for to test the setup. You can test further by selecting EVR as renderer, open the control panel for your AMD/Nvidia GPU and override the color settings (e.g. kill the saturation).
Here's a working setup
http://img832.imageshack.us/img832/3551/zpffdshowquicksyncmadvr.png
CruNcher
19th October 2011, 14:08
awesome i just love NT6 now we can mix input output like crazy without needing any 3rd party solutions great work egur :)
i wonder though is DXVA also working or does the decoder need specifically to support this ?
And what happens if you open a DXVA session and where does it get rendered ?
nevcairiel
19th October 2011, 14:53
i wonder though is DXVA also working or does the decoder need specifically to support this ?
And what happens if you open a DXVA session and where does it get rendered ?
You can't easily transfer GPU textures between devices, so if you use DXVA, it needs to be rendered on the same device that decoded it.
Besides, if you already use DXVA, why not use the DXVA of your primary video card? :p
egur
19th October 2011, 16:24
New and improved version. Zip files contains installer and documentation, please read.
Download version 0.16 alpha:
32 bit http://www.multiupload.com/Z4PX2UFGB4
64 bit http://www.multiupload.com/QH5ZZXINCQ
Source code http://www.multiupload.com/06IZWGH4T0
Revision highlights:
v1.16:
* Support multi GPU setups. Now the decoder can run on separate HW then the renderer. Even without connecting the Intel GPU to a screen. See Multi GPU below for details.
* This version will be the first version on SourceForge.
* Updated to ffdshow build 3996.
* Some fixes to the timestamp code. Now supporting streams with no frame rate.
* Fixed several aspect ratio issues.
* Very initial support for DVD playback. Menus are not displayed right yet. WIP. Recommend not to use except for testing purposes.
* Changed mechanism for handling flush & seek event. Code is faster and more robust. A critical stage for playing DVDs.
* Added a new callback for FFDShow’s internal decoders – EndFlush. This is needed for DVD playback. Other decoders do not need to implement it.
* Enhanced FFDShow’s code with a faster memcpy function (SSE2 based). This replaces calling memcpy. The original source code would use ffmpeg to do it, but it crashes on NV12 images.
egur
19th October 2011, 16:45
awesome i just love NT6 now we can mix input output like crazy without needing any 3rd party solutions great work egur :)
i wonder though is DXVA also working or does the decoder need specifically to support this ?
And what happens if you open a DXVA session and where does it get rendered ?
DXVA connections will not cross HW boundaries. Maybe there's a tricky way to do it, but I doubt it's worth the trouble.
My decoder is mostly aimed at low power, but it was a nice problem to solve. I'm not aware of similar solutions.
Since I copy the frames from the GPU to the CPU very quickly, it makes sense in using it with your favorite SW setup. The pipeline is File->CPU->GPU1->CPU->GPU2->Screen.
This opens up a way for fast HW decoding with super strong programmable video processing on a discrete GPU.
I wish Windows 7 was easier to use in sense of utilizing the various HW resources.
Atak_Snajpera
19th October 2011, 17:24
Quicksync in official ffdshow r4000 would epic :)
ajp_anton
19th October 2011, 17:45
Now, it's clear. I'll add a checkbox option for ffdshow's codec page to decline a connection in such cases. Default behavior will be fall back to libavcodec or other internal decoder.
Does this have any relevance to other formats (VC1, MPEG2)?What happened?
egur
19th October 2011, 20:33
What happened?
This part wasn't ready for this release. Currently it falls back to libavcodec if the platform can't support QuickSync or for H264 unsupported formats.
If something else is unsupported ffdhsow will decline the connection.
I'll fix this for next release. Hopefully after I integrate into the main ffdshow trunk in sourceforge.
pulbitz
20th October 2011, 16:30
I'm sorry. I don't speak English very well.
audio/video unsync (with Gabest Splitter) sample files.
(2011.09.28) Hyun Young 조현영 _A_ @ Gachon University Festival Celebration Fancam(720p_H.264-AAC).mp4
http://o-o.preferred.fra02s05.v5.lscache1.c.youtube.com/videoplayback?sparams=id%2Cexpire%2Cip%2Cipbits%2Citag%2Csource%2Cratebypass%2Ccp&fexp=904539%2C914032%2C903119%2C900221&itag=22&ip=121.0.0.0&signature=8C7C37527E0C230FA0A01CFAE0620BFA923D7B6E.8EA9A67F9D7C0A227858F78B0DFBF2ADE9A5CF1B&sver=3&ratebypass=yes&source=youtube&expire=1319148000&key=yt1&ipbits=8&cp=U0hQTlFPVl9FSkNOMF9JSVpBOk12b085N3JnWmY4&id=1089491d982d9386
(2011.10.06) Hyun Young 조현영 _Mach_ @ Gyeonggi University of S&T Festival Fancam(720p_H.264-AAC).mp4
http://o-o.preferred.fra02s05.v7.lscache8.c.youtube.com/videoplayback?sparams=id%2Cexpire%2Cip%2Cipbits%2Citag%2Csource%2Cratebypass%2Ccp&fexp=904539%2C914032%2C903119%2C900221&itag=22&ip=121.0.0.0&signature=5F64ED712E76CE6F1A91E8CB333257FA14091F93.9D07A528297624A90D5979D530452CFEF44787AE&sver=3&ratebypass=yes&source=youtube&expire=1319148000&key=yt1&ipbits=8&cp=U0hQTlFPVl9FSkNOMF9JSVpBOk12b085N3JnWmY4&id=2130d6979f680c4c
QuickSync = 30.303fps
libavcodec = 29.97xfps
please improve your timestamp code more. :)
JanWillem32
20th October 2011, 21:11
DXVA connections will not cross HW boundaries. Maybe there's a tricky way to do it, but I doubt it's worth the trouble.
My decoder is mostly aimed at low power, but it was a nice problem to solve. I'm not aware of similar solutions.
Since I copy the frames from the GPU to the CPU very quickly, it makes sense in using it with your favorite SW setup. The pipeline is File->CPU->GPU1->CPU->GPU2->Screen.
This opens up a way for fast HW decoding with super strong programmable video processing on a discrete GPU.
I wish Windows 7 was easier to use in sense of utilizing the various HW resources.DMA access to GPU memory has been around since forever. (http://en.wikipedia.org/wiki/Direct_memory_access for those that are interested.)
Allocating a buffer explicitly in the video memory has always been possible. Proper memory resource management is even a key feature to any graphics rendering engine.
Sharing resources trough the DirectX API is relatively new: http://msdn.microsoft.com/en-us/library/windows/desktop/bb219800%28v=vs.85%29.aspx and http://msdn.microsoft.com/en-us/library/windows/desktop/ee913554%28v=vs.85%29.aspx . The usual DXVA helper device for EVR uses a shared handle system to give the main rendering device access to DXVA output surfaces. The extra device runs mostly asynchronously from the main device.
File->CPU->GPU1->GPU2->Screen is completely allowed, but I don't know what would be faster, a render target on GPU1's memory or on GPU2's memory. Making GPU1 render to system memory or doing an extra copy operation from video memory to system memory will most certainly slow things down.
It's actually not the copy operation itself that's an issue. It's usually the wait for the lock operation. Scheduled transfers without locking surfaces/textures in video memory are a lot more efficient.
CruNcher
20th October 2011, 21:54
I got Cyberlink HAM working on Intel it's basically nothing else then a Renderless DXVA (not bound to the renderer) that also Potplayers DXVA Decoder makes use of.
The big questions is do we really need APIs from every vendor for NT6 if Microsofts integrated the possibility to use DXVA Renderless from the beginning, and why integrate every each vendor ones if 1 for all exists (in terms of interoperability) ?
DXVA Renderless (supports everyone)
AMD OpenVIdeo (supports AMD)
Intel MediaSDK (supports Intel)
Nvidia Nvcuvid (supports Nvidia)
Is there really such a big Performance difference that would justify implementing each vendors own (or is there even a performance lose doing so wrapping from a to b), for the specific hardware case ?
egur
20th October 2011, 22:24
I'm sorry. I don't speak English very well.
audio/video unsync (with Gabest Splitter) sample files.
(2011.09.28) Hyun Young 조현영 _A_ @ Gachon University Festival Celebration Fancam(720p_H.264-AAC).mp4
http://o-o.preferred.fra02s05.v5.lscache1.c.youtube.com/videoplayback?sparams=id%2Cexpire%2Cip%2Cipbits%2Citag%2Csource%2Cratebypass%2Ccp&fexp=904539%2C914032%2C903119%2C900221&itag=22&ip=121.0.0.0&signature=8C7C37527E0C230FA0A01CFAE0620BFA923D7B6E.8EA9A67F9D7C0A227858F78B0DFBF2ADE9A5CF1B&sver=3&ratebypass=yes&source=youtube&expire=1319148000&key=yt1&ipbits=8&cp=U0hQTlFPVl9FSkNOMF9JSVpBOk12b085N3JnWmY4&id=1089491d982d9386
(2011.10.06) Hyun Young 조현영 _Mach_ @ Gyeonggi University of S&T Festival Fancam(720p_H.264-AAC).mp4
http://o-o.preferred.fra02s05.v7.lscache8.c.youtube.com/videoplayback?sparams=id%2Cexpire%2Cip%2Cipbits%2Citag%2Csource%2Cratebypass%2Ccp&fexp=904539%2C914032%2C903119%2C900221&itag=22&ip=121.0.0.0&signature=5F64ED712E76CE6F1A91E8CB333257FA14091F93.9D07A528297624A90D5979D530452CFEF44787AE&sver=3&ratebypass=yes&source=youtube&expire=1319148000&key=yt1&ipbits=8&cp=U0hQTlFPVl9FSkNOMF9JSVpBOk12b085N3JnWmY4&id=2130d6979f680c4c
QuickSync = 30.303fps
libavcodec = 29.97xfps
please improve your timestamp code more. :)
I know I need to improve the time stamps. Fixed a few things in the v0.16 but there's still more work to do...
The links you've posted are not working - "access denied" for both. You can share very quickly on http://www.multiupload.com
egur
20th October 2011, 22:42
DXVA Renderless (supports everyone)
AMD OpenVIdeo (supports AMD)
Intel MediaSDK (supports Intel)
Nvidia Nvcuvid (supports Nvidia)
Is there really such a big Performance difference that would justify implementing each vendors own, for the specific hardware case ?
There's a difference in features (mostly related to video processing) and DXVA is very complex and not high level enough.
Hopefully this chaos will converge to a single API at some point. A user friendly API that abstracts enough details while remaining high performing.
Performance is a very important issue in the mobile world - battery life. Every minute of video playback is worth a lot of R&D, validation and enabling resources.
The architecture "war" with ARM (starting with Windows 8) will probably help to push HW acceleration forward on many fronts so small devices can compete with ARM based SOCs.
Nvidia plays both sides of the fence in this war (GPU for x86 platforms as well as ARM CPU maker) so one can expect them to fork out a cross platform API for HW acceleration. This would probably be the best kind of API - abstract the HW completely - no need to be a DirectX expert to do complex stuff.
egur
20th October 2011, 22:49
DMA access to GPU memory has been around since forever. (http://en.wikipedia.org/wiki/Direct_memory_access for those that are interested.)
Allocating a buffer explicitly in the video memory has always been possible. Proper memory resource management is even a key feature to any graphics rendering engine.
Sharing resources trough the DirectX API is relatively new: http://msdn.microsoft.com/en-us/library/windows/desktop/bb219800%28v=vs.85%29.aspx and http://msdn.microsoft.com/en-us/library/windows/desktop/ee913554%28v=vs.85%29.aspx . The usual DXVA helper device for EVR uses a shared handle system to give the main rendering device access to DXVA output surfaces. The extra device runs mostly asynchronously from the main device.
File->CPU->GPU1->GPU2->Screen is completely allowed, but I don't know what would be faster, a render target on GPU1's memory or on GPU2's memory. Making GPU1 render to system memory or doing an extra copy operation from video memory to system memory will most certainly slow things down.
It's actually not the copy operation itself that's an issue. It's usually the wait for the lock operation. Scheduled transfers without locking surfaces/textures in video memory are a lot more efficient.
I haven't seen anything like this - two DXVA devices from different GPUs passing surfaces from one to the other?
I can take your word for it but it's probably extremely complicated to accomplish.
In the Intel GPU, I don't think there's any DMA going on when copying surfaces back and forth to the CPU. It's the same memory sitting on the same memory controller. A special SSE4 instruction was introduced in Penryn to address the complex mapping to solve the speed issues.
CruNcher
21st October 2011, 01:02
I just recorded my first 3D Gaming with my Low Latency H.264 Quicksync Encoder Framework it runs rather smooth in the 3D Engine (ID tech 5) at least playable. Entirely on GT1 (Playing + Recording) :D
vBulletin® v3.8.11, Copyright ©2000-2025, vBulletin Solutions Inc.