Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > New and alternative video codecs

Reply
 
Thread Tools Search this Thread Display Modes
Old 8th January 2012, 01:37   #441  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,342
Quote:
Originally Posted by rica View Post
EDIT: Here is the test file for you. (It is working with ffdshow 4225, btw.):

http://www.mediafire.com/?x2jk7irhns6506d


_ _ _ _
Not sure what that file is meant to show..? Decodes just fine.

Anyhow, you're saying that QuickSync in ffdshow works on your Clarkdale, but with LAV it doesn't?
Did you check the CPU usage to confirm that its really using hardware decoding?
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is online now   Reply With Quote
Old 8th January 2012, 01:41   #442  |  Link
rica
Registered User
 
Join Date: Mar 2008
Posts: 2,021
Quote:
Originally Posted by nevcairiel View Post
Not sure what that file is meant to show..? Decodes just fine.

Anyhow, you're saying that QuickSync in ffdshow works on your Clarkdale, but with LAV it doesn't?
Did you check the CPU usage to confirm that its really using hardware decoding?
Sure I did. I will add the screen caps if you have enough time to wait.

EDIT: Here, they are:





_ _ _ _

Last edited by rica; 8th January 2012 at 02:06.
rica is offline   Reply With Quote
Old 8th January 2012, 01:52   #443  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,342
I can probably patch up a debug version that shows why QS fails, maybe it sheds some light on things.

Edit:
http://files.1f0.de/lavf/LAVVideo-0.44-debug.zip

Throw that on top of 0.44, and a log file should appear on your desktop. Maybe there is something interesting in there....
Paste the log on pastebin or something, don't want to wait for attachment approval.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders

Last edited by nevcairiel; 8th January 2012 at 02:02.
nevcairiel is online now   Reply With Quote
Old 8th January 2012, 02:11   #444  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,342
Quote:
Originally Posted by rica View Post
EDIT: Here, they are:



Thats DXVA, not QuickSync
In any case, a log file would be useful.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is online now   Reply With Quote
Old 8th January 2012, 02:21   #445  |  Link
rica
Registered User
 
Join Date: Mar 2008
Posts: 2,021
OK, tomorrow/or today.
Thanks!
rica is offline   Reply With Quote
Old 8th January 2012, 04:12   #446  |  Link
CruNcher
Registered User
 
CruNcher's Avatar
 
Join Date: Apr 2002
Location: Germany
Posts: 4,926
Quote:
Originally Posted by egur View Post
EVR is special in the sense that it's output is different between Intel, Nvidia and AMD. Also different between GPU families and generations.
Many people think EVR uses simple bi-linear interpolation for scaling. For old cards, they are right. For the new cards that very wrong.
SandyBridge use a very advanced scaler. So it's best to test for yourself. The power/performance of the GPU is best utilized by the EVR. You should check the quality by applying a large scale factor to a clip (e.g. 4x) as well as very small scale factor (e.g. 1/4x).

The biggest issue is still subtitling though which @ least in MPC-HC is still dependent on EVR Custom.
So you allways have 1 issue either no Subtitling or all the Deinterlace Pain, there are only a few renderer that can do everything DXVA, + Deinterlacing + Subtitling all custom DirectX Renderer. MadVR could be another one once it supports DXVA + Custom Shader Code . Though i wouldn't agree with this "The power/performance of the GPU is best utilized by the EVR." it can only be fully utilized by a Custom Renderer these days that utilizes the same backend as a Game Engine does
__________________
all my compares are riddles so please try to decipher them yourselves :)

It is about Time

Join the Revolution NOW before it is to Late !

http://forum.doom9.org/showthread.php?t=168004

Last edited by CruNcher; 8th January 2012 at 06:32.
CruNcher is offline   Reply With Quote
Old 8th January 2012, 08:47   #447  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by egur View Post
Version 0.22 beta is out with the following changes:
* Much better multi-threading code (many fixes from v0.21).
* Fixed dynamic aspect ratio change during playback.
* FFDShow rev4227
Quote:
Originally Posted by nevcairiel View Post
LAV Filters 0.44 now also "officially" features the 0.22 decoder.
Multi-threading is still off for the time being, until i can do proper testing.
I have done no tests with FFDShow 0.22 nor LAV Filters 0.44, but I think that by using Multi-Threaded code for the required work that has to be done by IA cores, may provide solution for eveything.

I mean that multi-threaded code should:

1) Increase the throughput required by 60fps clips

2) Feed better the QS decoding engine and

3) Push QS to maximum speed (frequency)

After all these, the decoding performance of 60fps clips should definitely increase.

About power consumption, the CPU frequency will go down from the Turbo Mode of single threaded code during playback of 60fps clips and probably power consumption will go down too.

During benchmarking or during playback of future difficult clips at 120fps the power consumption will increase again, I think.

Looking forward to test your next optimized multi-threaded versions in real tests.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 8th January 2012, 09:02   #448  |  Link
egur
QuickSync Decoder author
 
Join Date: Apr 2011
Location: Atlit, Israel
Posts: 916
Quote:
Originally Posted by NikosD View Post
I have done no tests with FFDShow 0.22 nor LAV Filters 0.44, but I think that by using Multi-Threaded code for the required work that has to be done by IA cores, may provide solution for eveything.

I mean that multi-threaded code should:

1) Increase the throughput required by 60fps clips

2) Feed better the QS decoding engine and

3) Push QS to maximum speed (frequency)

After all these, the decoding performance of 60fps clips should definitely increase.

About power consumption, the CPU frequency will go down from the Turbo Mode of single threaded code during playback of 60fps clips and probably power consumption will go down too.

During benchmarking or during playback of future difficult clips at 120fps the power consumption will increase again, I think.

Looking forward to test your next optimized multi-threaded versions in real tests.
MT's purposes are the following:
1) Reduce decode thread latency - the decode thread will do just the HW decode and delivery of decoded images down the pipeline. A worker thread will do the rest - most time consuming tasks are frame copy and lockings the d3d9 surfaces. This allows more CPU work (video processing) to performed after decode.
2) Increase performance by adding parallelism - since the HW decode and the frame copying work in parallel, the HW decoder is better utilized allowing more FPS.

The MT work is not done. I believe I can achieve better performance than v0.22. v0.22 is much more stable than 0.21.
__________________
Eric Gur,
Processor Application Engineer for Overclocking and CPU technologies
Intel QuickSync Decoder author
Intel Corp.
egur is offline   Reply With Quote
Old 8th January 2012, 09:24   #449  |  Link
CruNcher
Registered User
 
CruNcher's Avatar
 
Join Date: Apr 2002
Location: Germany
Posts: 4,926
@ Eric
you allways post that sf.net url with a " @ the end

BTW: Nev does the QFHD fallback to LAV (CPU) it's better though not doing that for ffdshow-quicksync also in terms of having a comparison point as Nev has no option in LAV Video to disable this restriction.
__________________
all my compares are riddles so please try to decipher them yourselves :)

It is about Time

Join the Revolution NOW before it is to Late !

http://forum.doom9.org/showthread.php?t=168004

Last edited by CruNcher; 8th January 2012 at 09:55.
CruNcher is offline   Reply With Quote
Old 8th January 2012, 09:35   #450  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by egur View Post
MT's purposes are the following:
1) Reduce decode thread latency - the decode thread will do just the HW decode and delivery of decoded images down the pipeline. A worker thread will do the rest - most time consuming tasks are frame copy and lockings the d3d9 surfaces. This allows more CPU work (video processing) to performed after decode.
2) Increase performance by adding parallelism - since the HW decode and the frame copying work in parallel, the HW decoder is better utilized allowing more FPS.

The MT work is not done. I believe I can achieve better performance than v0.22. v0.22 is much more stable than 0.21.
So, the MT code involves IA cores, GPU cores, MFX engine or all of them ?
Could you give percentages for each component using MT code?
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 8th January 2012, 10:42   #451  |  Link
egur
QuickSync Decoder author
 
Join Date: Apr 2011
Location: Atlit, Israel
Posts: 916
Quote:
Originally Posted by NikosD View Post
So, the MT code involves IA cores, GPU cores, MFX engine or all of them ?
Could you give percentages for each component using MT code?
MT means running two or more threads on the IA cores. Running two or more code paths in parallel.

The QS decoder calls API functions of the Intel Media SDK to utilize the MFX engine. The Media SDK abstract the communication with HW much better than DXVA. The GPU cores (EUs) may be involved for some internal operations (don't know too much about this), but the bulk of the work is done by the MFX engine.
GPU parallelism as well as EU usage is abstracted by the MSDK and may change from generation to generation (or even driver versions).
__________________
Eric Gur,
Processor Application Engineer for Overclocking and CPU technologies
Intel QuickSync Decoder author
Intel Corp.
egur is offline   Reply With Quote
Old 8th January 2012, 10:48   #452  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,342
Quote:
Originally Posted by NikosD View Post
1) Increase the throughput required by 60fps clips
Even the most difficult clips you could find decode at 120fps, typical blu-rays decode at 300+ fps, i think 60fps clips are fine.
You over-estimate what multi-threading means, during normal playback there will be nearly zero difference, you only see it when benchmarking - so its really not all that great.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is online now   Reply With Quote
Old 8th January 2012, 11:04   #453  |  Link
CruNcher
Registered User
 
CruNcher's Avatar
 
Join Date: Apr 2002
Location: Germany
Posts: 4,926
when you use quicksync decoding for encoding it can be

Eric do you know why the new driver has been removed http://webcache.googleusercontent.co...wnldID%3D20676

Had no issues with it
__________________
all my compares are riddles so please try to decipher them yourselves :)

It is about Time

Join the Revolution NOW before it is to Late !

http://forum.doom9.org/showthread.php?t=168004

Last edited by CruNcher; 8th January 2012 at 11:14.
CruNcher is offline   Reply With Quote
Old 8th January 2012, 11:18   #454  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
@Egur

Thanks for the info.
The misunderstanding occurred by the following.
You used the term "HW decode thread" which usually means decoding not in CPU IA cores. CPU decoding usually referred as software decoding.
So by using the term "HW decode thread" you actually mean the work that has to be done in CPU to prepare the data for HW decoding in MFX engine.
No actual decoding happens in CPU.

@Nevcariel and Egur

I expect from MT code to drop the frequency of CPU during playback of 60fps clips from Turbo mode to much lower frequency, by spreading the load to more cores.

Have you seen by yourselves that even in normal playback mode of 60fps clips the CPU goes in Turbo Mode increasing Power consumption ?

Not in pure DXVA mode and not in other clips <60 fps.

Only with FFDShow QS decoder and only in 60fps and above.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all

Last edited by NikosD; 8th January 2012 at 11:24.
NikosD is offline   Reply With Quote
Old 8th January 2012, 11:23   #455  |  Link
CruNcher
Registered User
 
CruNcher's Avatar
 
Join Date: Apr 2002
Location: Germany
Posts: 4,926
NikosD don't you understand the Frame Copy GPU->CPU is pressuring the CPU ?
__________________
all my compares are riddles so please try to decipher them yourselves :)

It is about Time

Join the Revolution NOW before it is to Late !

http://forum.doom9.org/showthread.php?t=168004

Last edited by CruNcher; 8th January 2012 at 11:26.
CruNcher is offline   Reply With Quote
Old 8th January 2012, 11:27   #456  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by CruNcher View Post
NikosD don't you understand the Frame Copy GPU->CPU is pressuring the CPU ?
Read my previous post again please.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 8th January 2012, 11:39   #457  |  Link
CruNcher
Registered User
 
CruNcher's Avatar
 
Join Date: Apr 2002
Location: Germany
Posts: 4,926
Over here its not always @ full frequency (on balanced power profile) it fluctuates as expected due to the frame copy though playback is fine with 4 girls and 5 birds also absolute smooth 2k the same 4k low bitrate also problems get heavy with high bitrate QFHD And yeah MT might be able to more evenly distribute the load so that frequency fluctuates less. Though also keep in mind that frequency isn't really costing that much power @ all voltage increase is the main factor here. You will not get DXVA consumption from ffdshow-quicksync it will always be lower with DXVA solely due to the frame copy.
__________________
all my compares are riddles so please try to decipher them yourselves :)

It is about Time

Join the Revolution NOW before it is to Late !

http://forum.doom9.org/showthread.php?t=168004

Last edited by CruNcher; 8th January 2012 at 12:01.
CruNcher is offline   Reply With Quote
Old 8th January 2012, 11:52   #458  |  Link
egur
QuickSync Decoder author
 
Join Date: Apr 2011
Location: Atlit, Israel
Posts: 916
Quote:
Originally Posted by NikosD View Post
@Egur

Thanks for the info.
The misunderstanding occurred by the following.
You used the term "HW decode thread" which usually means decoding not in CPU IA cores. CPU decoding usually referred as software decoding.
So by using the term "HW decode thread" you actually mean the work that has to be done in CPU to prepare the data for HW decoding in MFX engine.
No actual decoding happens in CPU.

@Nevcariel and Egur

I expect from MT code to drop the frequency of CPU during playback of 60fps clips from Turbo mode to much lower frequency, by spreading the load to more cores.

Have you seen by yourselves that even in normal playback mode of 60fps clips the CPU goes in Turbo Mode increasing Power consumption ?

Not in pure DXVA mode and not in other clips <60 fps.

Only with FFDShow QS decoder and only in 60fps and above.
As CruNcher said, frame copy takes CPU cycles. That's a fact. in 1080p@60 there's a lot to copy, hence the higher CPU usage. Pure DXVA solution will always be faster (unless there's buggy or poorly implemented).

Regarding Turbo, in my tests Turbo wasn't active for the entire duration of playback. If there's a lot of compute/cpu work to be done, the most efficient way to it is in bursts and not by spreading the workload across time. Idle time after a burst allows power management to kick in. If you're worried about that the GPU will lose it's power budget to the CPU and thus work in a lower frequency, you may or may not be right. The algorithm for deciding this is not exposed to the public.
Anyway, this is only the start not the end of MT.

@CruNcher: 10x for correcting the typo in the web site link.
__________________
Eric Gur,
Processor Application Engineer for Overclocking and CPU technologies
Intel QuickSync Decoder author
Intel Corp.
egur is offline   Reply With Quote
Old 8th January 2012, 12:07   #459  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
@Egur

Is it possible for the Frame Copy process to be executed in parallel in more than one core? Or is it a strictly serial process ?

By using more cores for Frame Copy alone, could help us compare the behavior of the whole CPU package in different situations during playback, to what we have now with serial Frame Copy.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 8th January 2012, 12:51   #460  |  Link
egur
QuickSync Decoder author
 
Join Date: Apr 2011
Location: Atlit, Israel
Posts: 916
Quote:
Originally Posted by NikosD View Post
@Egur

Is it possible for the Frame Copy process to be executed in parallel in more than one core? Or is it a strictly serial process ?

By using more cores for Frame Copy alone, could help us compare the behavior of the whole CPU package in different situations during playback, to what we have now with serial Frame Copy.
Excellent question!
In the Core2Dou days, I did a few benchmarks as I wrote an application that did a lot of memory copying.
My results were:
* Writing memcpy using SSE2 was 2x faster than the standard library version (vs2005).
* Using 2 threads gave almost 2x performance boost. Using more than 2 didn't change anything.

The benchmark were for large buffers (usually > 1M).
So my copy function was ~4x faster than the standard memcpy.

Today, using SSE2 copy doesn't change all that much, either vs2010 has a better memcpy or the CPU uArch implements the simpler memcpy better. Regarding threads, I need to test this.
I can assume that using 2 threads will help. This is next on my list. I'll make a programmable solution that allows scaling beyond 2 threads.
I'll post the results in this thread.

BTW, parallelizing memcpy is super trivial. Probably the easiest task to make parallel.
__________________
Eric Gur,
Processor Application Engineer for Overclocking and CPU technologies
Intel QuickSync Decoder author
Intel Corp.
egur is offline   Reply With Quote
Reply

Tags
ffdshow, h264, intel, mpeg2, quicksync, vc1, zoom player

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 11:07.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.