View Full Version : Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing
egur
8th January 2012, 10:42
So, the MT code involves IA cores, GPU cores, MFX engine or all of them ?
Could you give percentages for each component using MT code?
MT means running two or more threads on the IA cores. Running two or more code paths in parallel.
The QS decoder calls API functions of the Intel Media SDK to utilize the MFX engine. The Media SDK abstract the communication with HW much better than DXVA. The GPU cores (EUs) may be involved for some internal operations (don't know too much about this), but the bulk of the work is done by the MFX engine.
GPU parallelism as well as EU usage is abstracted by the MSDK and may change from generation to generation (or even driver versions).
nevcairiel
8th January 2012, 10:48
1) Increase the throughput required by 60fps clips
Even the most difficult clips you could find decode at 120fps, typical blu-rays decode at 300+ fps, i think 60fps clips are fine. :p
You over-estimate what multi-threading means, during normal playback there will be nearly zero difference, you only see it when benchmarking - so its really not all that great.
CruNcher
8th January 2012, 11:04
when you use quicksync decoding for encoding it can be :)
Eric do you know why the new driver has been removed http://webcache.googleusercontent.com/search?q=cache:FW2q-fo2vqYJ:downloadcenter.intel.com/Detail_Desc.aspx%3FDwnldID%3D20676
Had no issues with it
NikosD
8th January 2012, 11:18
@Egur
Thanks for the info.
The misunderstanding occurred by the following.
You used the term "HW decode thread" which usually means decoding not in CPU IA cores. CPU decoding usually referred as software decoding.
So by using the term "HW decode thread" you actually mean the work that has to be done in CPU to prepare the data for HW decoding in MFX engine.
No actual decoding happens in CPU.
@Nevcariel and Egur
I expect from MT code to drop the frequency of CPU during playback of 60fps clips from Turbo mode to much lower frequency, by spreading the load to more cores.
Have you seen by yourselves that even in normal playback mode of 60fps clips the CPU goes in Turbo Mode increasing Power consumption ?
Not in pure DXVA mode and not in other clips <60 fps.
Only with FFDShow QS decoder and only in 60fps and above.
CruNcher
8th January 2012, 11:23
NikosD don't you understand the Frame Copy GPU->CPU is pressuring the CPU ?
NikosD
8th January 2012, 11:27
NikosD don't you understand the Frame Copy GPU->CPU is pressuring the CPU ?
Read my previous post again please.
CruNcher
8th January 2012, 11:39
Over here its not always @ full frequency (on balanced power profile) it fluctuates as expected due to the frame copy though playback is fine with 4 girls and 5 birds also absolute smooth 2k the same 4k low bitrate also problems get heavy with high bitrate QFHD :) And yeah MT might be able to more evenly distribute the load so that frequency fluctuates less. Though also keep in mind that frequency isn't really costing that much power @ all voltage increase is the main factor here. You will not get DXVA consumption from ffdshow-quicksync it will always be lower with DXVA solely due to the frame copy.
egur
8th January 2012, 11:52
@Egur
Thanks for the info.
The misunderstanding occurred by the following.
You used the term "HW decode thread" which usually means decoding not in CPU IA cores. CPU decoding usually referred as software decoding.
So by using the term "HW decode thread" you actually mean the work that has to be done in CPU to prepare the data for HW decoding in MFX engine.
No actual decoding happens in CPU.
@Nevcariel and Egur
I expect from MT code to drop the frequency of CPU during playback of 60fps clips from Turbo mode to much lower frequency, by spreading the load to more cores.
Have you seen by yourselves that even in normal playback mode of 60fps clips the CPU goes in Turbo Mode increasing Power consumption ?
Not in pure DXVA mode and not in other clips <60 fps.
Only with FFDShow QS decoder and only in 60fps and above.
As CruNcher said, frame copy takes CPU cycles. That's a fact. in 1080p@60 there's a lot to copy, hence the higher CPU usage. Pure DXVA solution will always be faster (unless there's buggy or poorly implemented).
Regarding Turbo, in my tests Turbo wasn't active for the entire duration of playback. If there's a lot of compute/cpu work to be done, the most efficient way to it is in bursts and not by spreading the workload across time. Idle time after a burst allows power management to kick in. If you're worried about that the GPU will lose it's power budget to the CPU and thus work in a lower frequency, you may or may not be right. The algorithm for deciding this is not exposed to the public.
Anyway, this is only the start not the end of MT.
@CruNcher: 10x for correcting the typo in the web site link.
NikosD
8th January 2012, 12:07
@Egur
Is it possible for the Frame Copy process to be executed in parallel in more than one core? Or is it a strictly serial process ?
By using more cores for Frame Copy alone, could help us compare the behavior of the whole CPU package in different situations during playback, to what we have now with serial Frame Copy.
egur
8th January 2012, 12:51
@Egur
Is it possible for the Frame Copy process to be executed in parallel in more than one core? Or is it a strictly serial process ?
By using more cores for Frame Copy alone, could help us compare the behavior of the whole CPU package in different situations during playback, to what we have now with serial Frame Copy.
Excellent question!
In the Core2Dou days, I did a few benchmarks as I wrote an application that did a lot of memory copying.
My results were:
* Writing memcpy using SSE2 was 2x faster than the standard library version (vs2005).
* Using 2 threads gave almost 2x performance boost. Using more than 2 didn't change anything.
The benchmark were for large buffers (usually > 1M).
So my copy function was ~4x faster than the standard memcpy.
Today, using SSE2 copy doesn't change all that much, either vs2010 has a better memcpy or the CPU uArch implements the simpler memcpy better. Regarding threads, I need to test this.
I can assume that using 2 threads will help. This is next on my list. I'll make a programmable solution that allows scaling beyond 2 threads.
I'll post the results in this thread.
BTW, parallelizing memcpy is super trivial. Probably the easiest task to make parallel.
NikosD
8th January 2012, 13:00
Very good.
So you are into it.
You could also try Intel C++ Compiler, which whenever I used it - last time it was 11.1 version - I remember that it was a lot faster than MS VC++.
egur
8th January 2012, 13:05
Very good.
So you are into it.
You could also try Intel C++ Compiler, which whenever I used it - last time it was 11.1 version - I remember that it was a lot faster than MS VC++.
I support ICL12.1 builds.
Assembly code (copy function) is not subject to optimization by the compiler (as far as I know) and the rest of my code is very light so I doubt any compiler can do noticeably better.
rica
8th January 2012, 14:20
Thats DXVA, not QuickSync
In any case, a log file would be useful.
You're right; it was my mistake.
Today I tried again with LAV 0.44 and FFShow 4227.
Even I can see QuickSync as an option and select, neither LAV, nor FFShow can use it on my Clarkdale.
Here is the debug file:
LAVVideo.ax(tid f40) 28470 : CTransformInputPin::CTransformInputPin
LAVVideo.ax(tid f40) 28470 : CTransformOutputPin::CTransformOutputPin
LAVVideo.ax(tid f40) 28470 : SetMediaType -- in
LAVVideo.ax(tid f40) 28470 : ::CreateDecoder(): Creating new decoder...
LAVVideo.ax(tid f40) 28470 : -> Process is mpc-hc.exe, blacklist: 0
LAVVideo.ax(tid f40) 28470 : CDecQuickSync::Init(): Trying to open QuickSync decoder
LAVVideo.ax(tid f40) 28673 : -> Decoder reports abnormal status
LAVVideo.ax(tid f40) 28673 : -> Init Interfaces failed (hr: 0x80004005)
LAVVideo.ax(tid f40) 28673 : -> Hardware decoder failed to initialize, re-trying with software...
LAVVideo.ax(tid f40) 28673 : Shutting down ffmpeg...
LAVVideo.ax(tid f40) 28673 : Initializing ffmpeg for codec 28
LAVVideo.ax(tid f40) 28673 : -> Processing extradata of 51 bytes
LAVVideo.ax(tid f40) 28673 : -> File extension: .m2ts
LAVVideo.ax(tid f40) 28673 : ff_lockmgr: mutex: 042C7600, op: 1
LAVVideo.ax(tid f40) 28673 : ff_lockmgr: mutex: 042C7600, op: 2
LAVVideo.ax(tid f40) 28673 : -> ffmpeg codec opened successfully (ret: 0)
LAVVideo.ax(tid f40) 28673 : AVCodec init successfull. interlaced: 1
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 0
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 1
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 2
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 3
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 4
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 5
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 6
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 7
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 8
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 9
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 10
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 11
LAVVideo.ax(tid f40) 28673 : ::GetMediaType(): position: 12
LAVVideo.ax(tid f40) 29335 : Trying to connect Pins :
LAVVideo.ax(tid f40) 29335 : <XForm Out>
LAVVideo.ax(tid f40) 29335 : <EVR Input0>
LAVVideo.ax(tid f40) 29335 : ::GetMediaType(): position: 0
LAVVideo.ax(tid f40) 29335 : Trying media type:
LAVVideo.ax(tid f40) 29335 : major type: MEDIATYPE_Video
LAVVideo.ax(tid f40) 29335 : sub type : MEDIASUBTYPE_NV12
LAVVideo.ax(tid f40) 29336 : ::CheckTransform()
LAVVideo.ax(tid f40) 29336 : ::CheckTransform()
LAVVideo.ax(tid f40) 29336 : SetMediaType -- out
LAVVideo.ax(tid f40) 29338 : ::GetMediaType(): position: 0
LAVVideo.ax(tid f40) 29338 : ::GetMediaType(): position: 1
LAVVideo.ax(tid f40) 29338 : ::DecideBufferSize()
LAVVideo.ax(tid f40) 29338 : ::CheckTransform()
LAVVideo.ax(tid f40) 29344 : ::CheckTransform()
LAVVideo.ax(tid f40) 29344 : Connection succeeded
LAVVideo.ax(tid f40) 29374 : ff_lockmgr: mutex: 042C7600, op: 1
LAVVideo.ax(tid f40) 29374 : ff_lockmgr: mutex: 042C7600, op: 2
LAVVideo.ax(tid b88) 30502 : ::CheckTransform()
LAVVideo.ax(tid b88) 30508 : ::CheckTransform()
LAVVideo.ax(tid b88) 30508 : ::CheckTransform()
LAVVideo.ax(tid b88) 30552 : ::CheckTransform()
LAVVideo.ax(tid 144c) 30553 : ::NewSegment - 0 / 0
LAVVideo.ax(tid 918) 30568 : h264RandomAccess::parseForRecoveryPoint(): Found I frame
LAVVideo.ax(tid 918) 30568 : h264RandomAccess::parseForRecoveryPoint(): Found IDR slice
LAVVideo.ax(tid 918) 30685 : ::GetDeliveryBuffer(): Sample contains new media type from downstream filter..
LAVVideo.ax(tid 918) 30685 : -> Width changed from 1920 to 1920 (target: 1920)
LAVVideo.ax(tid 918) 30685 : ::CheckTransform()
LAVVideo.ax(tid 918) 30685 : SetMediaType -- out
LAVVideo.ax(tid 918) 64887 : EndOfStream, flushing decoder
LAVVideo.ax(tid 918) 65058 : EndOfStream finished, decoder flushed
LAVVideo.ax(tid b88) 69245 : ::BeginFlush
LAVVideo.ax(tid b88) 69250 : ::EndFlush
LAVVideo.ax(tid 144c) 69250 : ::NewSegment - 0 / 0
LAVVideo.ax(tid 918) 69255 : h264RandomAccess::parseForRecoveryPoint(): Found I frame
LAVVideo.ax(tid 918) 69255 : h264RandomAccess::parseForRecoveryPoint(): Found IDR slice
LAVVideo.ax(tid b88) 69351 : ::BeginFlush
LAVVideo.ax(tid b88) 69359 : ::EndFlush
LAVVideo.ax(tid f40) 75428 : ::BreakConnect
LAVVideo.ax(tid f40) 75428 : ff_lockmgr: mutex: 042C7600, op: 1
LAVVideo.ax(tid f40) 75428 : ff_lockmgr: mutex: 042C7600, op: 2
LAVVideo.ax(tid f40) 75428 : ::BreakConnect
LAVVideo.ax(tid f40) 75428 : Shutting down ffmpeg...
LAVVideo.ax(tid f40) 75428 : ff_lockmgr: mutex: 042C7600, op: 1
LAVVideo.ax(tid f40) 75438 : ff_lockmgr: mutex: 042C7600, op: 2
LAVVideo.ax(tid f40) 75438 : ff_lockmgr: mutex: 042C7600, op: 3
LAVVideo.ax(tid f40) 75438 : ff_lockmgr: mutex: 042C7648, op: 3
LAVVideo.ax(tid f40) 75438 : CTransformOutputPin::~CTransformOutputPin
H55+i3540 Clarkdale on Seven 32.
Thanks!
nevcairiel
8th January 2012, 14:55
That message means that the "getOK" function returns failure.
Maybe Eric can shed some light onto that, i sadly dont have any older hardware to test on.
rica
8th January 2012, 15:06
Thanks for checking out!
DragonQ
8th January 2012, 17:35
I also ran that debug version of LAV 0.44 on my i5-430M (Arrandale). Here's my log, looks exactly the same in the important bits:
LAVVideo.ax(tid 1da0) 2 : CTransformInputPin::CTransformInputPin
LAVVideo.ax(tid 1da0) 2 : CTransformOutputPin::CTransformOutputPin
LAVVideo.ax(tid 1da0) 2 : SetMediaType -- in
LAVVideo.ax(tid 1da0) 2 : ::CreateDecoder(): Creating new decoder...
LAVVideo.ax(tid 1da0) 2 : -> Process is mpc-hc.exe, blacklist: 0
LAVVideo.ax(tid 1da0) 2 : CDecQuickSync::Init(): Trying to open QuickSync decoder
LAVVideo.ax(tid 1da0) 61 : -> Decoder reports abnormal status
LAVVideo.ax(tid 1da0) 61 : -> Init Interfaces failed (hr: 0x80004005)
LAVVideo.ax(tid 1da0) 61 : -> Hardware decoder failed to initialize, re-trying with software...
LAVVideo.ax(tid 1da0) 62 : Shutting down ffmpeg...
LAVVideo.ax(tid 1da0) 62 : Initializing ffmpeg for codec 28
LAVVideo.ax(tid 1da0) 63 : -> Processing extradata of 163 bytes
LAVVideo.ax(tid 1da0) 63 : -> File extension: .mkv
LAVVideo.ax(tid 1da0) 63 : ff_lockmgr: mutex: 05A43518, op: 1
LAVVideo.ax(tid 1da0) 71 : ff_lockmgr: mutex: 05A43518, op: 2
LAVVideo.ax(tid 1da0) 71 : -> ffmpeg codec opened successfully (ret: 0)
LAVVideo.ax(tid 1da0) 71 : AVCodec init successfull. interlaced: 0
LAVVideo.ax(tid 1da0) 71 : ::GetMediaType(): position: 0
LAVVideo.ax(tid 1da0) 71 : ::GetMediaType(): position: 1
LAVVideo.ax(tid 1da0) 71 : ::GetMediaType(): position: 2
LAVVideo.ax(tid 1da0) 71 : ::GetMediaType(): position: 3
LAVVideo.ax(tid 1da0) 71 : ::GetMediaType(): position: 4
LAVVideo.ax(tid 1da0) 71 : ::GetMediaType(): position: 5
LAVVideo.ax(tid 1da0) 71 : ::GetMediaType(): position: 6
LAVVideo.ax(tid 1da0) 71 : ::GetMediaType(): position: 7
LAVVideo.ax(tid 1da0) 71 : ::GetMediaType(): position: 8
LAVVideo.ax(tid 1da0) 71 : ::GetMediaType(): position: 9
LAVVideo.ax(tid 1da0) 72 : ::GetMediaType(): position: 10
LAVVideo.ax(tid 1da0) 72 : ::GetMediaType(): position: 11
LAVVideo.ax(tid 1da0) 72 : ::GetMediaType(): position: 12
LAVVideo.ax(tid 1da0) 428 : Trying to connect Pins :
LAVVideo.ax(tid 1da0) 428 : <XForm Out>
LAVVideo.ax(tid 1da0) 428 : <EVR Input0>
LAVVideo.ax(tid 1da0) 428 : ::GetMediaType(): position: 0
LAVVideo.ax(tid 1da0) 428 : Trying media type:
LAVVideo.ax(tid 1da0) 429 : major type: MEDIATYPE_Video
LAVVideo.ax(tid 1da0) 429 : sub type : MEDIASUBTYPE_NV12
LAVVideo.ax(tid 1da0) 429 : ::CheckTransform()
LAVVideo.ax(tid 1da0) 429 : ::CheckTransform()
LAVVideo.ax(tid 1da0) 429 : SetMediaType -- out
LAVVideo.ax(tid 1da0) 430 : ::GetMediaType(): position: 0
LAVVideo.ax(tid 1da0) 430 : ::GetMediaType(): position: 1
LAVVideo.ax(tid 1da0) 430 : ::DecideBufferSize()
LAVVideo.ax(tid 1da0) 430 : ::CheckTransform()
LAVVideo.ax(tid 1da0) 438 : ::CheckTransform()
LAVVideo.ax(tid 1da0) 438 : Connection succeeded
LAVVideo.ax(tid c04) 1534 : ::CheckTransform()
LAVVideo.ax(tid c04) 1540 : ::CheckTransform()
LAVVideo.ax(tid c04) 1541 : ::CheckTransform()
LAVVideo.ax(tid c04) 1576 : ::CheckTransform()
LAVVideo.ax(tid 1d70) 1580 : ::NewSegment - 0 / 0
LAVVideo.ax(tid 81c) 1581 : h264RandomAccess::parseForRecoveryPoint(): Found IDR slice
LAVVideo.ax(tid 81c) 1613 : ::GetDeliveryBuffer(): Sample contains new media type from downstream filter..
LAVVideo.ax(tid 81c) 1613 : -> Width changed from 1920 to 1920 (target: 1920)
LAVVideo.ax(tid 81c) 1613 : ::CheckTransform()
LAVVideo.ax(tid 81c) 1613 : SetMediaType -- out
LAVVideo.ax(tid 1da0) 9578 : ::BeginFlush
LAVVideo.ax(tid 1da0) 9656 : ::EndFlush
LAVVideo.ax(tid 1da0) 9679 : ::BreakConnect
LAVVideo.ax(tid 1da0) 9679 : ::BreakConnect
LAVVideo.ax(tid 1da0) 9679 : Shutting down ffmpeg...
LAVVideo.ax(tid 1da0) 9679 : ff_lockmgr: mutex: 05A43518, op: 1
LAVVideo.ax(tid 1da0) 9691 : ff_lockmgr: mutex: 05A43518, op: 2
LAVVideo.ax(tid 1da0) 9691 : ff_lockmgr: mutex: 05A43518, op: 3
LAVVideo.ax(tid 1da0) 9691 : ff_lockmgr: mutex: 05A43560, op: 3
LAVVideo.ax(tid 1da0) 9691 : CTransformOutputPin::~CTransformOutputPin
nevcairiel
8th January 2012, 19:16
Eric,
seeking in this file (VC-1) causes the decoder to stop functioning, it just doesn't output any image anymore.
http://www.multiupload.com/2403BDXAFT
I didn't do any real checks yet, but it doesn't output any debug messages.
Esperado
8th January 2012, 21:10
In the first version i tried (With DvbViewer), mpeg2 was out of sync image before sound.
In the last two ones, H.264 freeze after half a second and crashes the program. Mpeg2 seems to work OK. But with fluctuations in the images rates.
(ffdshow 32bits in Seven 64)
egur
8th January 2012, 21:12
Eric,
seeking in this file (VC-1) causes the decoder to stop functioning, it just doesn't output any image anymore.
http://www.multiupload.com/2403BDXAFT
I didn't do any real checks yet, but it doesn't output any debug messages.
I'll check it out.
Update
I've root caused the problem. It will take me a few days to fix.
I've also implemented a MT frame copy but unfortunately the performance boost is next to zero.
CruNcher
9th January 2012, 00:23
@Eric
updated to the newest Arcsoft Beta and the Decoder seems now to use DXVA no issues anymore with Lav Splitter Intel and any .WMV Build Keynote doesn't crash anymore also the 15 fps clip plays @ 30 fps now as it should :)
2.28.474.133 <- Problems
2.28.480.134 <- Fixed upcoming TMT5.2
Also MC.ts works without issues DXVA accelerated with their decoder this is a little surprising seeing ffdshow-quicksync and lav Video quicksync fail @ decoding also the decoding error (H.264) stream works without that frame corruption.
http://img16.imageshack.us/img16/4839/arcsoftvc1dxva.png
No corruptions same for Cyberlinks Decoder
Lav Video refuses the connection for this file (hehe nev) ;)
ffdshow-quicksync
http://img834.imageshack.us/img834/3481/ffdshowquicksyncvc1tsde.png
So maybe this has actually something todo with the TS parsing as it seems the Quicksync Decoder (Hardware) is OK ;)
PS: Arcsofts DXVA also handles to much ref frames now like the CoreAVC DXVA implementation :)
nevcairiel
9th January 2012, 07:52
PS: Arcsofts DXVA also handles to much ref frames now like the CoreAVC DXVA implementation :)
There is no such thing as "too much" ref frames, 16 is the maximum, and the hardware can decode it just fine.
egur
9th January 2012, 08:30
@Eric
Also MC.ts works without issues DXVA accelerated with their decoder this is a little surprising seeing ffdshow-quicksync and lav Video quicksync fail @ decoding also the decoding error (H.264) stream works without that frame corruption.
I think both DXVA decoders are doing some kind of stream processing before sending the samples to the HW.
CruNcher
9th January 2012, 15:33
There is no such thing as "too much" ref frames, 16 is the maximum, and the hardware can decode it just fine.
Sorry out of Spec Ref frames @ DXVA playback this is under normal conditions resulting in decoding errors (blocking) though Mirillis , Arcsoft and CoreCodec found a way to avoid this in their implementations ;)
Also Eric testing WVC1 playback with Lav Video as well as ffdshow-quicksync showed lower overal utilization then with Arcsofts DXVA Decoder that really surprised first time i saw that even with the Memory Copy it's beating a DXVA Decoder and that with a pretty high difference also.
egur
9th January 2012, 16:57
...
Also Eric testing WVC1 playback with Lav Video as well as ffdshow-quicksync showed lower overal utilization then with Arcsofts DXVA Decoder that really surprised first time i saw that even with the Memory Copy it's beating a DXVA Decoder and that with a pretty high difference also.
This is odd, what are the numbers?
I also have a VC1 related question.
Does the VC1 spec allow sending the exact same image (e.g. same buffer) multiple times with different time stamps?
CruNcher
9th January 2012, 19:25
This is odd, what are the numbers?
I also have a VC1 related question.
Does the VC1 spec allow sending the exact same image (e.g. same buffer) multiple times with different time stamps?
The difference is 10% cpu utilization more for AP@L4 for Arcsofts DXVA Decoder i think i see also why it outputs by default in YUY2 instead of NV12
Arcsoft = Video: YUY2 1920x1080 50.00fps 14100kbps
Lav Video = Video: NV12 1920x1080 50.00fps
though changing Lav Video to YUY2 doesn't cause such a big difference @ all, strange strange
so in the end on MPC-HC Arcsofts DXVA renders with a utilization of 22% vs 12%
nevcairiel
9th January 2012, 20:16
This is odd, what are the numbers?
I also have a VC1 related question.
Does the VC1 spec allow sending the exact same image (e.g. same buffer) multiple times with different time stamps?
Yes,vc1 has a feature called skipped p frames, which means its supposed to output the previous p frame again. I can imagine that's whats happening here.
egur
9th January 2012, 21:33
Yes,vc1 has a feature called skipped p frames, which means its supposed to output the previous p frame again. I can imagine that's whats happening here.
Good to know. I need to handle this situation.
This is not relevant to LAV since you implicitly disable my internal queuing (by disabling the time stamp correction).
After I'll fix this I'll release another ffdshow build.
@CruNcher
10x for the info. I thought you meant it was faster than DXVA QuickSync (e.g. MS DTV-DVD decoder).
My decoder is usually much faster than Nvidia or AMD in all supported codecs.
The main goal of the QS decoder is power saving + low cpu utilization while retaining a simple (relatively) and open SW architecture.
betaking
10th January 2012, 06:03
Good to know. I need to handle this situation.
This is not relevant to LAV since you implicitly disable my internal queuing (by disabling the time stamp correction).
After I'll fix this I'll release another ffdshow build.
@CruNcher
10x for the info. I thought you meant it was faster than DXVA QuickSync (e.g. MS DTV-DVD decoder).
My decoder is usually much faster than Nvidia or AMD in all supported codecs.
The main goal of the QS decoder is power saving + low cpu utilization while retaining a simple (relatively) and open SW architecture.
compile standalone IntelQuickSyncDecoder.dll failed
正在运行 C/C++ 代码分析...
1>QuickSyncUtils.cpp(179): error C2220: 警告被视为错误 - 没有生成“object”文件
1>c:\qsdecoder\intelquicksyncdecoder\quicksyncutils.cpp(176): warning C6312: Possible infinite loop: use of the constant EXCEPTION_CONTINUE_EXECUTION in the exception-filter expression of a try-except. Execution restarts in the protected block
1>c:\qsdecoder\intelquicksyncdecoder\quicksyncutils.cpp(177): warning C6322: Empty _except block
正在生成代码...
1>已完成生成项目“C:\qsdecoder\IntelQuickSyncDecoder\IntelQuickSyncDecoder.vcxproj”(build 个目标)的操作 - 失败。
but compile with last ffdshow no problem!
egur
10th January 2012, 10:20
Static analysis failure fixed at rev21.
betaking
10th January 2012, 10:33
Static analysis failure fixed at rev21.
thank you for quick fix!
NikosD
10th January 2012, 16:58
@Egur
Have you seen this ? (http://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers/)
A little old, though.
nevcairiel
10th January 2012, 17:13
@Egur
Have you seen this ? (http://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers/)
A little old, though.
He already uses that instruction.
egur
10th January 2012, 20:13
@Egur
Have you seen this ? (http://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers/)
A little old, though.
It may show some improvement (using a 4K cache), I'll test it. This is similar to what VLC does.
This article refers to Penryn - the first Intel cpu with the movntdqa instruction (SSE4.1). My version has an optimization that doesn't appear in the article so maybe an even faster version is possible.
What was optimal for Penryn might not be optimal for SandyBridge and vice versa.
NikosD
10th January 2012, 22:26
Eric,
Intel released a new version of Media SDK 2012 (http://software.intel.com/en-us/articles/vcsource-tools-media-sdk/)
Do we have to wait for new Intel drivers to integrate it, or is it possible to use it with current drivers ?
Can we install it over the current Media SDK installed by latest Intel's driver ?
Does it have any speed or other improvement regarding your QS decoder ?
I think I read that it has better handling of CPU <-> GPU memory communication.
I don't know if you have tested it already with your QS decoder.
UPDATE: Is there an easy way - without registration - to download Intel Media Checker ?
nevcairiel
10th January 2012, 22:54
Thats just the final version of the 3.0 SDK, which was in beta for like a year or so. I don't expect many SDK related changes.
CharlieCL
11th January 2012, 03:24
It may show some improvement (using a 4K cache), ...
It may be faster by using AVX instructions which are 256 bits.
nevcairiel
11th January 2012, 08:18
It may be faster by using AVX instructions which are 256 bits.
Its a common misconception that a new instruction set will magically make everything better.
Each SIMD instruction set has a very specific set of functions for very specific tasks, its not a generic set of instructions like x86 itself is.
AVX deals mostly with floating point operations, which are useless in this case. AVX2, which will come with the Haswell micro-architecture, will expand this to Integer/Memory operations, and might then potentially be useful.
egur
11th January 2012, 08:48
The new Intel Media SDK 2012, which exposes a new API (v1.3 instead of 1.1) will come in to play when a future driver will be released.
Current driver (v2559) supports API version 1.1.
From the technical side of things, linking with the new MSDK will change nothing. The HW implementation DLL is shipped with the driver. The library that QS decoder links to is a dispatch library. It just selects which implementation to use (HW - graphics driver DLL or SW a DLL that comes with the MSDK).
If linking with the SW implementation, you can use the new features. This is important for developers to get a head start on development before production drivers and new HW (e.g. IvyBridge) arrive.
I'll update the MSDK headers and lib with the new MSDK in SVN but again, it changes nothing until for the time being.
Performance update
I tried using the implementation from the old Intel article.
I used 2 clips both 1080p, one was relatively low bitrate (mp4 file I downloaded from the net) and the other high bitrate (Sumsong underwater clip ~40mbps).
In the Sumsong clip nothing changed, the HW decoder was slower than the frame copying.
The low bitrate clip ran at ~700 fps using the article's method and ~770 fps using the existing method.
Nehalem/Westmere and Sandybridge are very different when it comes to memory operations (SandyBridge is much better of course :) ).
I'll forward this topic to architecture as well as the author of this article. Maybe there's a 3rd way to do things.
BTW, AVX will not help in this case. We can try AVX2 (not sure it's relevant either) when it's available to the public.
NikosD
11th January 2012, 09:14
Thanks for the detailed infor about Media SDK.
In my opinion your QS decoder has only one obvious "weak" point.
Handling of 60fps and more clips, no matter the bitrate, if it's high or low.
So I think you could try the old implementation or any other future implementation mainly in those > 30 fps clips, like 50fps or 60fps.
There are plenty of such files in the link I gave you before here (http://xhmikosr.1f0.de/index.php?folder=c2FtcGxlcy8yMTYwcA==) and with versions at 1080p compatible with QS.
Is it possible to upload Intel Media Checker (http://software.intel.com/partner/app/software-assessment?locale=en-US&cid=ISPP:106BL103ENG1721&utm_campaign=sat-awareness&utm_content=flash-video&utm_source=flash&utm_medium=flash-video) somewhere easy to download ?
Thanks!
!llus!on
11th January 2012, 13:06
Is it possible to upload Intel Media Checker (http://software.intel.com/partner/app/software-assessment?locale=en-US&cid=ISPP:106BL103ENG1721&utm_campaign=sat-awareness&utm_content=flash-video&utm_source=flash&utm_medium=flash-video) somewhere easy to download ?
Thanks!
Here you go: Intel Media Checker v2.0 Multiupload (http://www.multiupload.com/ZQH1W3YGDF)
egur
11th January 2012, 13:30
Thanks for the detailed infor about Media SDK.
In my opinion your QS decoder has only one obvious "weak" point.
Handling of 60fps and more clips, no matter the bitrate, if it's high or low.
So I think you could try the old implementation or any other future implementation mainly in those > 30 fps clips, like 50fps or 60fps.
There are plenty of such files in the link I gave you before here (http://xhmikosr.1f0.de/index.php?folder=c2FtcGxlcy8yMTYwcA==) and with versions at 1080p compatible with QS.
...
I'll use a single copy function - whatever fastest.
One of the next steps is to output D3D9 surfaces (without any copying) and the renderer will be in charge of subtitles, etc. This flow is limited to a single GPU setup and I'm not sure how subtitles will be rendered.
I don't see a problem with 60fps. All HD clips play well above 60fps.
In low bitrates, using my i7-2600 and 1333MHz DDR3, I can pull ~800fps (NULL renderer). In very high bitrates, it drops to 100+ because the HW decoder is slowing the process. When >60fps movies (need matching screen & fast cable) will become more common, the HW should be fast enough.
Personally, I don't see the benefit of >60fps movies.
My fear is that there's some part in my code that limits performance by blocking the operation of the HW decoder. Maybe this type of flow would benefit from faster RAM, I'll buy 1833MHz RAM for my own HTPC.
NikosD
11th January 2012, 13:47
Here you go: Intel Media Checker v2.0 Multiupload (http://www.multiupload.com/ZQH1W3YGDF)
Thanks a lot!
I don't see a problem with 60fps. All HD clips play well above 60fps.
Yes they do. But they push CPU in Turbo mode even in normal playback.
Personally, I don't see the benefit of >60fps movies.
It's an open issue with a lot of different views.
60fps clips are by far smoother and "polished" than 30fps.
It's clear even by naked eye and amateurs. You don't have to be professional to see this.
I know the theory about the incapability of human eye to see more than 30fps.
For me the difference between 30fps and 60fps it's obvious, but I will not go on, on this subject.
My fear is that there's some part in my code that limits performance by blocking the operation of the HW decoder. Maybe this type of flow would benefit from faster RAM, I'll buy 1833MHz RAM for my own HTPC.
If you check Media Performance with the help of GPA, you will see that during playback & benchmarking, QS decoder is not utilizing MFX enfine as much as DXVA does.
So, there is definitely a bottleneck inside the code of something between your dll and HW.
Are you sure it's your code and not Intel Media SDK ?
NikosD
11th January 2012, 13:57
I hope Intel is going to fix this (http://semiaccurate.com/2012/01/09/intel-fakes-ivy-bridge-graphics-on-stage-at-ces/) before April.
Ivy means a lot to the whole CPU industry.
egur
11th January 2012, 14:06
It's an open issue with a lot of different views.
60fps clips are by far smoother and "polished" than 30fps.
>60 means greater than 60. I agree about 60fps being better than 30 fps.
If you check Media Performance with the help of GPA, you will see that during playback & benchmarking, QS decoder is not utilizing MFX enfine as much as DXVA does.
So, there is definitely a bottleneck inside the code of something between your dll and HW.
Are you sure it's your code and not Intel Media SDK ?
One explanation is that the renderer must do different work (copy the frame to the GPU in my case).
The other option is that the DXVA decoders run the HW decode on another thread. This is something on my TODO list. I wanted to thread the copy first and get that stable (I think now it's 100% stable) and hopefully optimized .
The decoder is still in beta so I'm not done yet :D
The MSDK may or may not block performance (probably it doesn't block performance). It's hard to tell and nothing to do about it...
One way to check is to check the utilization of the various MSDK sample decoders which are DXVA compliant (e.g. via the DXVA checker). If you time, you can try. Note that these decoders are sample code and don't have fancy features like real products (e.g. multi threading). So the results may not help much. If the samples have high MFX engine utilization, than my code is definitely to blame. The other case is less obvious since the samples don't have multi threading (for a good reason - complicate the samples too much).
egur
11th January 2012, 14:20
I hope Intel is going to fix this (http://semiaccurate.com/2012/01/09/intel-fakes-ivy-bridge-graphics-on-stage-at-ces/) before April.
Ivy means a lot to the whole CPU industry.
Just a demo of things to come, take this in the right proportions :)
I'm not worried about IvyBridge at all.
NikosD
11th January 2012, 16:06
After reading some more technical details about Sandy's GPU, I think that a good optimization for Frame Copy could be the use of LLC (Last Level Cache - L3) and ring interconnect to rapidly pass data from the GPU back to the CPU.
Almost any GPU data can be held in the LLC.
A flush command is needed to force data to be written back to the LLC prior to the CPU reading it.
The driver can also allocate a portion of the LLC as a non-coherent cache for display data and other uses.
Also AMD has introduced an OpenCL extension for a zero copy mechanism on Windows systems and as I read, presumably Intel will follow once they have OpenCL and DirectCompute capable hardware - which means IvyBridge because SandyBridge has no support of OpenCL and DirectCompute.
CharlieCL
11th January 2012, 17:23
Its a common misconception that a new instruction set will magically make everything better.
Each SIMD instruction set has a very specific set of functions for very specific tasks, its not a generic set of instructions like x86 itself is.
AVX deals mostly with floating point operations, which are useless in this case. AVX2, which will come with the Haswell micro-architecture, will expand this to Integer/Memory operations, and might then potentially be useful.
Sorry I suppose that AVX is a completed instruction set. But this is an Intel's mistake. An instruction set will be able to work for more than 10 years. So when they add one instruction they should think about 10 years' usage. Intel's SIMD instruction set is a failure. At first there was MMX, then SSE, SSE2, SSE3, SSE4, now AVX, future AVX2. This is similar to graphic cards from DX8, DX9, DX10, DX11. The GPU instruction set (shader model) was changed every year.
I hope that AVX will not repeat the mistake of SSE to generate AVX2, AVX3, AVX4 ... in the future. In this way there are few benefits for developers to use them.
NikosD
11th January 2012, 17:38
But they have to sell "new" processors/ graphics cards with "new" capabilities not found in previous generations.
nevcairiel
11th January 2012, 17:53
This is similar to graphic cards from DX8, DX9, DX10, DX11. The GPU instruction set (shader model) was changed every year.
Instruction sets are different, they very rarely replace each other, they add new instructions, you don't just use SSE4, you use SSE2+SSSE3+SSE4 in conjunction with each other.
The only cases of replacing was SSE2, which basically was MMX in 128-bit (instead of 64), and soon AVX2, which is SSE2 in 256-bit.
The comparison to GPUs is flawed in every way.
Also, what would you propose, just not invent new instruction sets? :p
If you ever had worked with them, you would know that they all do a very good job at doing *exactly* whats required for multimedia applications.
NikosD
11th January 2012, 18:26
You clearly misunderstood the meaning of his post.
The key point is:
Build it right and complete from the beginning like Apple, IBM and Motorola did with AltiVec and the PowerPC line.
As for GPUs they aren't directly comparable to CPUs but they also add things little by little just to have more selling points in their products.
vBulletin® v3.8.11, Copyright ©2000-2025, vBulletin Solutions Inc.