Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > MPEG-4 AVC / H.264

Reply
 
Thread Tools Search this Thread Display Modes
Old 3rd August 2010, 11:26   #21  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 12,896
Quote:
Originally Posted by Reimar View Post
Happen to have a reference at hand? I've only used CUDA, but my impression was that you rarely really need completely general pointers, since you have to copy the data to the GPU anyway and transforming it so far that e.g. everything ends up in a single buffer is not really a problem.
See here:
http://developer.download.nvidia.com...tart_Guide.pdf

Search for "Pointer Traversal" and you'll find:
...pointers must be converted to be relative to the buffer base pointer and only refer to data within the buffer itself (no pointers between OpenCL buffers are allowed)
__________________
There was of course no way of knowing whether you were being watched at any given moment.
How often, or on what system, the Thought Police plugged in on any individual wire was guesswork.


LoRd_MuldeR is offline   Reply With Quote
Old 3rd August 2010, 23:24   #22  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
I'm I the only one who believes that all "minor" x264 development should be postponed and all efforts should be focused on developing a way to offload the ME "at least" to the GPU ?

One can dream , can't he ?
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Old 4th August 2010, 01:15   #23  |  Link
royia
Drazick
 
Join Date: May 2003
Location: Israel
Posts: 137
Quote:
Originally Posted by TheImperial2004 View Post
I'm I the only one who believes that all "minor" x264 development should be postponed and all efforts should be focused on developing a way to offload the ME "at least" to the GPU ?

One can dream , can't he ?
As a user, Totally agree with you.

I wish someone would offer a significant encoding performances improvement.
royia is offline   Reply With Quote
Old 4th August 2010, 03:08   #24  |  Link
Guest
Guest
 
Join Date: Jan 2002
Posts: 21,923
Quote:
Originally Posted by TheImperial2004 View Post
I'm I the only one who believes that all "minor" x264 development should be postponed and all efforts should be focused on developing a way to offload the ME "at least" to the GPU ?

One can dream , can't he ?
Do you have any data or analysis to show that doing that would in fact result in a significant improvement in performance, or is this just something you hope is true?
Guest is offline   Reply With Quote
Old 4th August 2010, 05:28   #25  |  Link
Audionut
Registered User
 
Join Date: Nov 2003
Posts: 1,261
Quote:
Originally Posted by TheImperial2004 View Post
all "minor" x264 development should be postponed
More speed is always nice. But personally, x264 is 'fast enough' for my needs.

I would rather gain 20% more quality for the same bitrate than 20% more speed for the same options.
__________________
http://www.7-zip.org/
Audionut is offline   Reply With Quote
Old 4th August 2010, 07:24   #26  |  Link
Reimar
Registered User
 
Join Date: Jun 2005
Posts: 278
Quote:
Originally Posted by LoRd_MuldeR View Post
See here:
http://developer.download.nvidia.com...tart_Guide.pdf

Search for "Pointer Traversal" and you'll find:
...pointers must be converted to be relative to the buffer base pointer and only refer to data within the buffer itself (no pointers between OpenCL buffers are allowed)
Sorry, I meant a pointer to the discussion why it is expected to be a problem here.
I can only think of it if you have a lot of different-sized data sets that you have to copy around continuously, but unless NVidia has significantly improved the memory fragmentation issue even there it might be better to allocate a single "max-sized" buffer and do your own memory management - which eliminates the issue even though in a really bad way.
Reimar is offline   Reply With Quote
Old 4th August 2010, 11:05   #27  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
Do you have any data or analysis to show that doing that would in fact result in a significant improvement in performance, or is this just something you hope is true?
I read somewhere -from Dark Shikari I believe - that the most intensive phase of encoding is Motion Estimation . Correct me if I am wrong .

Quote:
I would rather gain 20% more quality for the same bitrate than 20% more speed for the same options.
More quality ? I can't even imagine how our encodes are going to be ... I mean come on ! At CRF of 18 , "ALL" my encodings are transparent to their sources (At least for my eyes) .

So I don't think we are in a rush for more quality at this point of development
__________________
AviDemux Windows Builds

Last edited by TheImperial2004; 4th August 2010 at 11:09.
TheImperial2004 is offline   Reply With Quote
Old 4th August 2010, 11:08   #28  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by TheImperial2004 View Post
I read somewhere -from Dark Shikari I believe - that the most intensive phase of encoding is Motion Estimation . Correct me if I am wrong .
That doesn't magically mean that it can be made faster on a GPU.
Dark Shikari is offline   Reply With Quote
Old 4th August 2010, 11:11   #29  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
That doesn't magically mean that it can be made faster on a GPU.
At least it would be "faster" than its for now , wouldn't it ?

EDIT : Wait a sec !!!

Does that mean ! All CUDA encoders right now are faster just because they produce much less quality than x264 ?!

Come to think of it , How can a GPU processor be 20x faster than CPU ? They advertise that their CUDA encoders are 20x faster than regular CPU ones ... !!!

Now I see your point DS ...
__________________
AviDemux Windows Builds

Last edited by TheImperial2004; 4th August 2010 at 11:18.
TheImperial2004 is offline   Reply With Quote
Old 4th August 2010, 11:21   #30  |  Link
IppE
Registered User
 
Join Date: Jul 2010
Posts: 11
If x264 used floating point calculations it would be a lot faster, but I'm pretty sure thats not the case.
IppE is offline   Reply With Quote
Old 4th August 2010, 11:37   #31  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by TheImperial2004 View Post
At least it would be "faster" than its for now , wouldn't it ?

EDIT : Wait a sec !!!

Does that mean ! All CUDA encoders right now are faster just because they produce much less quality than x264 ?!

Come to think of it , How can a GPU processor be 20x faster than CPU ? They advertise that their CUDA encoders are 20x faster than regular CPU ones ... !!!
Bolded the key words for you there.

A Honda Civic is 10 times faster than an Go-kart. It's easy to be "20 times faster" when you're comparing yourself to the worst encoders on the market.

I have yet to see a CUDA encoder that's faster than x264. All of them are "fast" because they use incredibly crappy encoding settings -- and if you set x264 to use comparable settings, x264 is equally (or even moreso) faster.
Dark Shikari is offline   Reply With Quote
Old 4th August 2010, 12:30   #32  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
Bolded the key words for you there.

A Honda Civic is 10 times faster than an Go-kart. It's easy to be "20 times faster" when you're comparing yourself to the worst encoders on the market.

I have yet to see a CUDA encoder that's faster than x264. All of them are "fast" because they use incredibly crappy encoding settings -- and if you set x264 to use comparable settings, x264 is equally (or even moreso) faster.
Yup ! I'll never wait for this anymore . Quality comes first for sure

Thanks DS . I'm now released from GPGPU torment
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Old 4th August 2010, 13:17   #33  |  Link
aegisofrime
Registered User
 
Join Date: Apr 2009
Posts: 452
Quote:
Originally Posted by Audionut View Post
More speed is always nice. But personally, x264 is 'fast enough' for my needs.

I would rather gain 20% more quality for the same bitrate than 20% more speed for the same options.
Well, like you said, for your needs.

I would say that speed is like money: You can never have enough of it. Nothing is ever fast enough. I would definitely say that x264 is not fast enough for HD content even if you have a i7-980X.

That said, would the AVX instructions in Bulldozer and Sandy Bridge make a huge difference in speed?

Last edited by aegisofrime; 4th August 2010 at 13:21.
aegisofrime is offline   Reply With Quote
Old 4th August 2010, 15:08   #34  |  Link
Firebird
Registered User
 
Join Date: Mar 2008
Posts: 61
Quote:
Originally Posted by aegisofrime View Post
Well, like you said, for your needs.
I would definitely say that x264 is not fast enough for HD content even if you have a i7-980X.
Wrong. As far as i know x264 can encode 1080p faster than realtime on ultrafast preset. You know, slow presets are supposed to be slow
Firebird is offline   Reply With Quote
Old 4th August 2010, 15:08   #35  |  Link
ForceX
Registered User
 
Join Date: Oct 2006
Posts: 150
Quote:
Originally Posted by TheImperial2004 View Post
Come to think of it , How can a GPU processor be 20x faster than CPU ?
GPUs are VERY highly threaded processors which are fantastic for highly parallelized workload. However they have traditionally been made for processing only graphics data. The recent changes towards GPU programming allows you to do more General Purpose work on them (Hence GP-GPU). CPUs are general purpose processors, and although they can perform a wide variety of work, they are basically a jack of all trade but master of none. For certain tasks which are suitable for GPUs it can outperform CPUs by several orders of magnitude. Notice that "certain tasks", however.

Theoretically a radeon 5970 has 9 times as much double precision floating point calculation ability than the best core i7 980X. Several matrices can be accelerated to be processed more than 9 times faster on such GPUs. However, if the code isn't suitable for such job you can end up actually losing speed compared to a CPU.

As far as I know x264 doesn't use FP calculations and the GPGPU programming landscape is a huge mess right now, so if a port of x264 to GPU would actually bring any advantage to the speed is highly debatable. Then there is the question of optimizing it.
ForceX is offline   Reply With Quote
Old 4th August 2010, 17:49   #36  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
GPUs are VERY highly threaded processors which are fantastic for highly parallelized workload. However they have traditionally been made for processing only graphics data. The recent changes towards GPU programming allows you to do more General Purpose work on them (Hence GP-GPU). CPUs are general purpose processors, and although they can perform a wide variety of work, they are basically a jack of all trade but master of none. For certain tasks which are suitable for GPUs it can outperform CPUs by several orders of magnitude. Notice that "certain tasks", however.

Theoretically a radeon 5970 has 9 times as much double precision floating point calculation ability than the best core i7 980X. Several matrices can be accelerated to be processed more than 9 times faster on such GPUs. However, if the code isn't suitable for such job you can end up actually losing speed compared to a CPU.

As far as I know x264 doesn't use FP calculations and the GPGPU programming landscape is a huge mess right now, so if a port of x264 to GPU would actually bring any advantage to the speed is highly debatable. Then there is the question of optimizing it.
Totally agree . But !

I believe that the major issue here is to synth. the data between two different entities . What if the GPU is just too fast for the CPU to keep up with ? Of course we will need the CPU to do some calculations . If the CPU is 9x slower than the GPU , then whats the point ? In that case , the GPU will have to wait for the CPU to respond and complete its part of the job , *only* then the GPU will continue doing its part . Lagging is the major issue here . Feel free to correct me though
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Old 4th August 2010, 17:56   #37  |  Link
royia
Drazick
 
Join Date: May 2003
Location: Israel
Posts: 137
A guy comes, taking on himself a big challenge yet you take all the wind out of his sail.
Try to support this.

Worse case we'll be left with the CPU :-).

Let him explore.
There are so many smart guys here, they might come up with a solution.
royia is offline   Reply With Quote
Old 4th August 2010, 18:35   #38  |  Link
mariush
Registered User
 
Join Date: Dec 2008
Posts: 563
Quote:
Originally Posted by TheImperial2004 View Post
Totally agree . But !

I believe that the major issue here is to synth. the data between two different entities . What if the GPU is just too fast for the CPU to keep up with ? Of course we will need the CPU to do some calculations . If the CPU is 9x slower than the GPU , then whats the point ? In that case , the GPU will have to wait for the CPU to respond and complete its part of the job , *only* then the GPU will continue doing its part . Lagging is the major issue here . Feel free to correct me though
GPUs are indeed much faster now than CPU processors, but the trend is for CPUs to gain more cores and you can see some tendencies to integrate the GPU and CPU in a single die, thus gaining much faster communication paths between them.

As for what you're talking about, I'm not sure that it can be improved a lot if the encoder keeps the "thinking" that it's supposed to receive a series of frames and that it must process it as they come.

Sure, it's needed for real time encoding and streaming but in lots of cases, the whole content to be encoded is already physically there.

I'm imagining for example, if you have a 10 GB video you need re-encoded and an 8 core processor, you could quickly parse the first 512 MB of this content, split it into 8 smaller chunks, upload to video card, let it do calculations while the 8-12 CPU threads do calculations on each of those 8 chunks... if cpu lags behind, just store the computations performed on GPU somewhere and upload to card the next 512 MB chunk and when it's all done, do some computations to glue these chunks together.

With cards nowadays having almost all at least 512 MB of memory on them with a lot of them having 768-1GB, that's (I think) plenty of space to fill with data to be crunched through while cpu glues everything up.

Not sure how much would these gpu results would be in disk space or normal memory and if it's fast enough to dump them to disk so that it would be faster than just doing it all on cpu - I don't see otherwise really a problem of using a lot of disk space to encode something - mbtree file already uses about 200 MB to encode 3 GB of content.

Last edited by mariush; 4th August 2010 at 18:38.
mariush is offline   Reply With Quote
Old 4th August 2010, 20:02   #39  |  Link
ForceX
Registered User
 
Join Date: Oct 2006
Posts: 150
Quote:
Originally Posted by TheImperial2004 View Post
Totally agree . But !

I believe that the major issue here is to synth. the data between two different entities . What if the GPU is just too fast for the CPU to keep up with ? Of course we will need the CPU to do some calculations . If the CPU is 9x slower than the GPU , then whats the point ? In that case , the GPU will have to wait for the CPU to respond and complete its part of the job , *only* then the GPU will continue doing its part . Lagging is the major issue here . Feel free to correct me though
Of course if you pair up the latest and greatest GPU with an obsolete slow CPU you're going to hit a bottleneck. One assumes you are not going to get the best of one component and use it with lowest end other components.

One way to avoid hitting a bottleneck would be to program the encoder to perform most of the (decoding and) encoding in GPU while the CPU is used only to maintain I/O and task scheduling. That, I don't see happening anytime soon due to technical constrains. "Running out of work" due to thread waits is already a big problem in the multicore CPU world and the people involved are investing huge amount of time to improve caching and branch prediction etc etc. It's just that, they never tried to do it that well to co-ordinate with the GPU. But as CPUs and GPUs are getting "fused" (lol math co-processor redux), using the GPUs for accelerating compression is only a matter of time.
ForceX is offline   Reply With Quote
Old 4th August 2010, 21:59   #40  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
GPUs are indeed much faster now than CPU processors, but the trend is for CPUs to gain more cores and you can see some tendencies to integrate the GPU and CPU in a single die, thus gaining much faster communication paths between them.

As for what you're talking about, I'm not sure that it can be improved a lot if the encoder keeps the "thinking" that it's supposed to receive a series of frames and that it must process it as they come.

Sure, it's needed for real time encoding and streaming but in lots of cases, the whole content to be encoded is already physically there.

I'm imagining for example, if you have a 10 GB video you need re-encoded and an 8 core processor, you could quickly parse the first 512 MB of this content, split it into 8 smaller chunks, upload to video card, let it do calculations while the 8-12 CPU threads do calculations on each of those 8 chunks... if cpu lags behind, just store the computations performed on GPU somewhere and upload to card the next 512 MB chunk and when it's all done, do some computations to glue these chunks together.

With cards nowadays having almost all at least 512 MB of memory on them with a lot of them having 768-1GB, that's (I think) plenty of space to fill with data to be crunched through while cpu glues everything up.

Not sure how much would these gpu results would be in disk space or normal memory and if it's fast enough to dump them to disk so that it would be faster than just doing it all on cpu - I don't see otherwise really a problem of using a lot of disk space to encode something - mbtree file already uses about 200 MB to encode 3 GB of content.
That seems a good idea . But !

" just store the computations performed on GPU somewhere "

I don't think there will be other place to store them other than HDD . And we all know what that might mean , Yes , Lag . For storing -let's say- 512MB segment every 10-30 seconds , I believe that the HDD will be the bottleneck here .

Your idea is great but I can't see it will improve encoding speed "magically" in the near future , especially when the HDD is involved .

Quote:
One way to avoid hitting a bottleneck would be to program the encoder to perform most of the (decoding and) encoding in GPU while the CPU is used only to maintain I/O and task scheduling. That, I don't see happening anytime soon due to technical constrains. "Running out of work" due to thread waits is already a big problem in the multicore CPU world and the people involved are investing huge amount of time to improve caching and branch prediction etc etc. It's just that, they never tried to do it that well to co-ordinate with the GPU. But as CPUs and GPUs are getting "fused" (lol math co-processor redux), using the GPUs for accelerating compression is only a matter of time.
As you said , "Running out of work" is already an issue in the multi-core world . So adding a GPU to the same die will still suffer the same lagging . Think of it this way : If it was lagging in the CPU <--> CPU operations , what do you think of CPU <--> CPU <--> GPU ?

I'm just woundering , if we are to offload "everything" to the GPU , how can a 600-700 MHz GPU be faster than a 3.0+ GHz CPU ? Isn't everything we are looking for is clock speeds ?

Corrections are welcome
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Reply

Tags
encoder, gpu, h.264

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 21:54.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.