PDA

View Full Version : x264 + avx support?!?


deadrats
28th January 2011, 23:45
just a quick question for DS and this is not meant to start anything or violate any of this forums rules but what gives?

i'm looking through the change log of the latest x264 builds and it says you implemented some avx support? didn't you go on 3 or 4 forums and repeatedly say that avx "was float only, thus a useless pile of tripe"? didn't you also challenge me to code a SAD function using avx? why the change of heart?

while i'm at it, what kind of performance gains are you seeing by using avx, i would guess that in functions that used to rely on 32 bit sse int's converted to 32 bit avx floats there should be a doubling of throughput within that function (though i understand that wouldn't translate to an over all doubling of performance).

lastly, if you don't mind i have a couple of questions about certain decisions you and the other developers have made that i really would like understand.

1) why is it that you don't use sse4's SAD capabilities? wouldn't "mpsadbw" be extremely useful in speeding up x264's ME?

2) looking through the intel developer forums, the claim was made that you guys were approached with the prospect of modifying x264 to make use of quick sync but that the sticking point was your (plural) demand/expectation that you have full control over the encoding process (basically you guys wanted access to the low level api). why is that?

a SAD function is a SAD function is a SAD function, there's only a couple of ways to implement one, why weren't you guys comfortable with a function call to an already implemented, in hardware, function? i'm assuming you guys thought that somehow it would impair the encoding quality but why would you think that?

lastly, i've been lurking through the cuda developer forums, as well as trying to teach myself cuda and i'm struck by something i'm hoping you can explain: i found some old posts you made in the nvidia forums were you asked for help coding a gpu powered SAD function and you did in fact get it up and running but you complained about the performance relative to what a software implemented function could achieve and though with the help of a couple of guys, you did manage to speed up the gpu SAD by a factor of 4x, you still couldn't achieve the throughput that you were getting with a software based SAD.

my question to you is this: why were you benchmarking a single instance of a gpu SAD against a single instance of cpu SAD? looking through tons of code, including from an open source h263 encoder and reading through the cuda documentation on running the same function multiple times simultaneously, why didn't you run multiple SAD calculations simultaneously, i.e. call the gpu powered SAD function once for each frame but run hundreds of instances simultaneously, assign the results to an array that the main array could read from when it caught up to the point it needed them.

the memory "issue" isn't really much of an issue, is it? i don't recall off the top of my head but let's assume that gpu ram has to be allocated in 64 kb chunks (just pulling a number out of my head), modern graphics cards routinely have 512 mb to 1024 mb of buffer, even if you allocated 512 kb of ram per calculation you would still be able to, theoretically, run at least 4 instances of the SAD function for every core a card had.

as for the supposed penalty of having to "upload" the data to the video card, again we're talking about a PCI-E bus, you're not going to saturate it with a SAD function (maybe with a happy function).

just wondering.

Dark Shikari
28th January 2011, 23:50
just a quick question for DS and this is not meant to start anything or violate any of this forums rules but what gives?

i'm looking through the change log of the latest x264 builds and it says you implemented some avx support? didn't you go on 3 or 4 forums and repeatedly say that avx "was float only, thus a useless pile of tripe"? didn't you also challenge me to code a SAD function using avx? why the change of heart?We're using it for 3-operand SSE support, as I originally said we were going to, as I posted patches for. Absolutely nothing has changed.

while i'm at it, what kind of performance gains are you seeing by using avx, i would guess that in functions that used to rely on 32 bit sse int's converted to 32 bit avx floats there should be a doubling of throughput within that function (though i understand that wouldn't translate to an over all doubling of performance).We're not using YMM registers or floating-point operations.

1) why is it that you don't use sse4's SAD capabilities? wouldn't "mpsadbw" be extremely useful in speeding up x264's ME?mpsadbw is useless. It can only be used to calculate a set of adjacent neighboring SADs, something that's only useful in an exhaustive search. x264 uses an exhaustive search algorithm that is much faster than mpsadbw.

2) looking through the intel developer forums, the claim was made that you guys were approached with the prospect of modifying x264 to make use of quick sync but that the sticking point was your (plural) demand/expectation that you have full control over the encoding process (basically you guys wanted access to the low level api). why is that? x264 is an encoder. Its job is to encode video. Its job is not to call another, unrelated encoder. If I wrote a version of x264 that called Mainconcept's encoder, it wouldn't be x264, it would be Mainconcept. Putting a Type-R sticker on a cheap Focus does not make it a better car.

Intel wanted to abuse the x264 name (the Type-R sticker) to promote their crappy encoder (the Focus).

a SAD function is a SAD function is a SAD function, there's only a couple of ways to implement one, why weren't you guys comfortable with a function call to an already implemented, in hardware, function? i'm assuming you guys thought that somehow it would impair the encoding quality but why would you think that?I'd love to have the ability to call one in hardware! Too bad there is no chip on the market -- not even a DSP -- with that ability. The fanciest function I've seen is probably in the vein of a 4x4 matrix multiply or such. No SADs.

(Well, there are some programmable ASICs with that ability. I think the OMAP4 has one. But no real CPU.)

lastly, i've been lurking through the cuda developer forums, as well as trying to teach myself cuda and i'm struck by something i'm hoping you can explain: i found some old posts you made in the nvidia forums were you asked for help coding a gpu powered SAD function and you did in fact get it up and running but you complained about the performance relative to what a software implemented function could achieve and though with the help of a couple of guys, you did manage to speed up the gpu SAD by a factor of 4x, you still couldn't achieve the throughput that you were getting with a software based SAD.

my question to you is this: why were you benchmarking a single instance of a gpu SAD against a single instance of cpu SAD? looking through tons of code, including from an open source h263 encoder and reading through the cuda documentation on running the same function multiple times simultaneously, why didn't you run multiple SAD calculations simultaneously, i.e. call the gpu powered SAD function once for each frame but run hundreds of instances simultaneously, assign the results to an array that the main array could read from when it caught up to the point it needed them.Because if you know how fast one SAD is, you can make calculations regarding thousands of SADs.

Of course, as it turned out, that wasn't quite true because of the problem of load coalescing, which I wasn't aware of at the time.

outlaw.78
29th January 2011, 01:24
We're using it for 3-operand SSE support, as I originally said we were going to, as I posted patches for. Absolutely nothing has changed.


how about bulldozer's 4-operand FMA4?
will you support this?

Sharktooth
29th January 2011, 02:13
the problem is it's FMA operates on float operands and it's still unknown how fast it is.
so, if it's worth the trouble it will be supported for sure.

Dark Shikari
29th January 2011, 03:45
the problem is it's FMA operates on float operands and it's still unknown how fast it is.
so, if it's worth the trouble it will be supported for sure.There is a set of XOP integer MAC functions though which might be useful. Bulldozer isn't out yet though, so I can't tell yet.

deadrats
29th January 2011, 03:47
x264 is an encoder. Its job is to encode video. Its job is not to call another, unrelated encoder. If I wrote a version of x264 that called Mainconcept's encoder, it wouldn't be x264, it would be Mainconcept.

I'd love to have the ability to call one in hardware! Too bad there is no chip on the market -- not even a DSP -- with that ability. The fanciest function I've seen is probably in the vein of a 4x4 matrix multiply or such. No SADs.

first of all thanks for taking the time to answer my questions.

now if you don't mind, just some clarification: i wasn't talking about calling MFXVideoEncode_EncodeFrameAsync, obviously then it wouldn't be x264 anymore, it would be the intel encoder, what i was talking about was calling/using some of the following:

MFX_COSTTYPE_SAD
MFX_COSTTYPE_SSD
MFX_COSTTYPE_HADAMARD

MFX_SEARCHTYPE_FULL
MFX_SEARCHTYPE_UMH
MFX_SEARCHTYPE_LOG
MFX_SEARCHTYPE_SQUARE
MFX_SEARCHTYPE_DIAMOND

the mediasdk manual sure makes it sound like it's not an "all or nothing" proposition, my reading of the pdf is that it's like a buffet where you can pick and choose what you want.

and the documentation sure makes it sound like all of the above is hardware accelerated, have you guys looked into perhaps using a function call to one of the above rather than using your own custom coded functions? i would think that if you could free the cpu from performing any of the more intense calculations that would be a good thing.

the documentation says there are function calls for doing quarter pixel, half pixel and full pixel motion vector precision calculations, about the only thing i can't find in the sdk is a function for entropy coding (cabac/calvac).

Dark Shikari
29th January 2011, 03:51
first of all thanks for taking the time to answer my questions.

now if you don't mind, just some clarification: i wasn't talking about calling MFXVideoEncode_EncodeFrameAsync, obviously then it wouldn't be x264 anymore, it would be the intel encoder, what i was talking about was calling/using some of the following:

MFX_COSTTYPE_SAD
MFX_COSTTYPE_SSD
MFX_COSTTYPE_HADAMARD

MFX_SEARCHTYPE_FULL
MFX_SEARCHTYPE_UMH
MFX_SEARCHTYPE_LOG
MFX_SEARCHTYPE_SQUARE
MFX_SEARCHTYPE_DIAMOND

the mediasdk manual sure makes it sound like it's not an "all or nothing" proposition, my reading of the pdf is that it's like a buffet where you can pick and choose what you want.

and the documentation sure makes it sound like all of the above is hardware accelerated, have you guys looked into perhaps using a function call to one of the above rather than using your own custom coded functions? i would think that if you could free the cpu from performing any of the more intense calculations that would be a good thing.

the documentation says there are function calls for doing quarter pixel, half pixel and full pixel motion vector precision calculations, about the only thing i can't find in the sdk is a function for entropy coding (cabac/calvac).Those are encoder parameters, like --me in x264 or -fpelcmp in ffmpeg. They're not functions you can call.

Mainconcept's encoder has many of the same parameters. Does that mean I can call Mainconcept from x264 and magically get better results than calling Mainconcept directly? Obviously not.

deadrats
30th January 2011, 00:37
Those are encoder parameters, like --me in x264 or -fpelcmp in ffmpeg. They're not functions you can call.

Mainconcept's encoder has many of the same parameters. Does that mean I can call Mainconcept from x264 and magically get better results than calling Mainconcept directly? Obviously not.

because i wanted to make sure i was right and not just talking b.s. i downloaded another couple of pdf's related media sdk and hardware acceleration and it does seem to me that you can call whatever function you want without calling the intel encoder, but i decided to be 100% fair to you so i went into the intel developer forums and found this FAQ:

http://software.intel.com/en-us/forums/showthread.php?t=68039&o=a&s=lr

How can I use my own SW library for encoding or decoding instead of using the Intel supplied library?

Replace libmfxsw32.dll or libmfxsw64.dll with your own SW library DLL. Use the same naming, and place into your unique install directory.

but in the interest of 100% percent certainty, i also posed the question to intel developers/engineers in this thread:

http://software.intel.com/en-us/forums/showthread.php?t=80339

you'll notice that i explicitly ask if it's possible to simply call quick sync's hardware accelerated SAD and/or ME functions from within a custom coded encoder or are they simply parameters that can only be passed to intel's encoder.

i have no idea what they will say, they haven't answered yet (as i just posted the question) but you're free to check on the answer yourself at a later time, either they will tell me to stick with killing rats or they will tell me that it is possible.

needless to say, for many reasons, i am hoping they tell me that it is possible.

nm
30th January 2011, 00:53
i decided to be 100% fair to you so i went into the intel developer forums and found this FAQ:

http://software.intel.com/en-us/forums/showthread.php?t=68039&o=a&s=lr

That's about replacing Intel's decoder/encoder with your own software (or whatever) implementation so that it can be used through Media SDK. Nothing there about getting access to private functions. Also note that the FAQ is from 2009 so it doesn't hold QuickSync-related information anyway.

CruNcher
30th January 2011, 02:47
@deadrats
Please finally understand that nothing of this is for Low Level access but ISV implementing Quick Syncs Encoder into their Products like Cyberlink and others are currently doing or already did, low level access was the thing Pidnoel fought for and most probably lost @ the Intel officials with his plan to let x264 utilize it ,we have to accept Intels decision here.
It's most probably a very business driven one as the ISVs Software Solutions would be hurt by allowing x264 to utilize the same Hardware on a low level. Quick Syncs quality (Search algorithms,Intra/Inter Prediction,Psy,Cabac,High Profile,Dynamic Gop) though have not been yet evaluated the tests done currently are to Consumer centric we have to wait for better evaluations :)
Though if they just ported their Software Encoder research to a Hardware level we already have that evaluation done over @ MSU
http://compression.ru/video/codec_comparison/mpeg-4_avc_h264_2007_en.html
Though it doesn't have to match with Quick Sync 100% as SB Hardware implementation was optimized for Speed, and it does really good @ that according to the few available tests (though we absolutely don't know what was used from the API in those tests).
Having a UMH searchtype though is cool Nvidia and Mainconcept so far have only Diamond search implemented on the GPU, also Hadamard is something Mainconcept as well as Nvidia still lack currently :)
Though the hardest part to beat in x264 is still PSY RD and the excellent AQ ;)

nm
3rd February 2011, 13:39
but in the interest of 100% percent certainty, i also posed the question to intel developers/engineers in this thread:

http://software.intel.com/en-us/forums/showthread.php?t=80339

you'll notice that i explicitly ask if it's possible to simply call quick sync's hardware accelerated SAD and/or ME functions from within a custom coded encoder or are they simply parameters that can only be passed to intel's encoder.

i have no idea what they will say, they haven't answered yet (as i just posted the question) but you're free to check on the answer yourself at a later time, either they will tell me to stick with killing rats or they will tell me that it is possible.

Killing rats it is:

Those are parameters of the mfxExtCodingOption structure. An application attaches this external buffer to the mfxVideoParam structure to configure additional options for the encoderís initialization. They are not functions that are exposed outside the MediaSDK, and cannot be used with a custom encoder.

deadrats
3rd February 2011, 15:04
@deadrats
Please finally understand that nothing of this is for Low Level access but ISV implementing Quick Syncs Encoder into their Products like Cyberlink and others are currently doing or already did, low level access was the thing Pidnoel fought for and most probably lost @ the Intel officials with his plan to let x264 utilize it ,we have to accept Intels decision here.
It's most probably a very business driven one as the ISVs Software Solutions would be hurt by allowing x264 to utilize the same Hardware on a low level. Quick Syncs quality (Search algorithms,Intra/Inter Prediction,Psy,Cabac,High Profile,Dynamic Gop) though have not been yet evaluated the tests done currently are to Consumer centric we have to wait for better evaluations :)
Though if they just ported their Software Encoder research to a Hardware level we already have that evaluation done over @ MSU
http://compression.ru/video/codec_comparison/mpeg-4_avc_h264_2007_en.html
Though it doesn't have to match with Quick Sync 100% as SB Hardware implementation was optimized for Speed, and it does really good @ that according to the few available tests (though we absolutely don't know what was used from the API in those tests).
Having a UMH searchtype though is cool Nvidia and Mainconcept so far have only Diamond search implemented on the GPU, also Hadamard is something Mainconcept as well as Nvidia still lack currently :)
Though the hardest part to beat in x264 is still PSY RD and the excellent AQ ;)

well one of the guys over in the intel developer forums did confirm for me that the are functions that can only be used by the sdk encoder.

my own experience with the software sdk encoder is that it's quite good, quality wise, and if run on intel hardware (so that the simd optimizations are enabled) runs like stink on a monkey.

what should really make quick sync stand out, once programmers have had a chance to code some apps that fully exploit all the features, is that we should be able to turn up the quality to max and not suffer a performance penalty, since everything is hardware accelerated.

deadrats
3rd February 2011, 15:07
Killing rats it is:

considering that i just got laid off, i won't be doing that either.

CruNcher
3rd February 2011, 16:00
well one of the guys over in the intel developer forums did confirm for me that the are functions that can only be used by the sdk encoder.

my own experience with the software sdk encoder is that it's quite good, quality wise, and if run on intel hardware (so that the simd optimizations are enabled) runs like stink on a monkey.

what should really make quick sync stand out, once programmers have had a chance to code some apps that fully exploit all the features, is that we should be able to turn up the quality to max and not suffer a performance penalty, since everything is hardware accelerated.

I will evaluate that shortly if my board is on its way today maybe over the weekend already :)
Though i wouldn't expect the same visual quality as of x264 yet especially with AQ and Psy-RD, though if its balanced out it could be really interesting compared to Nvidia and Atis Consumer GPU Encoder also Mainconcept and Elementals Pro Encoder.
The most interesting is the multitask aspect of it that you see in action here http://www.youtube.com/watch?v=vHpz04qPX-U having such capabilities and still be able to use the PC in a normal state is really where it seems to shine :)
If it can deliver that @ sane power consumption and balanced out quality it would be really nice :)
Over the long run its clear that such Hardware encoder will be overtaking just a matter of time they become mature (just like the decoders almost have fully overtaken,true more simple then encoding) enough (research time compared vs x264 and edge of improvability till need to move to H.265)
On the other side Hardware Encoder in the Consumer space are a thing i don't really like to see, because its just a workaround to a much bigger problem and thats interoperability in the video codec space H.264 couldn't end the transcoding dilema and still some codecs will pass till we reach that full interoperability between devices without ever again need to transcode the time will come not these days though and so Consumer need solutions like Quick Sync now :)
Thoug even Mobile devices on the lowest edge these days are already powerful enough to play flawless 720p and this year we finaly reach 1080p (performance has yet to be evaluated) so in that sense the need for transcoding is already very slowly vanishing :)
Tegra 2 though was a big disappointment in that area failing with x264s weighted prediction :(
Sonys NGP will set again like they did with the PSP for Mobile H.264 Full SD decoding the new boarder though Omap4 IVA-HD also has the chance todo so so it will be a interesting year :)

deadrats
3rd February 2011, 18:17
I will evaluate that shortly if my board is on its way today maybe over the weekend already

haven't you heard? all 6 series boards, P67/H67 have been recalled and pulled from the market due to faulty 6gbs sata controllers, you're not getting squat until about april.

as for power consumption, it's a 95 watt cpu, that's the most it can consume.

as for quality falling short of x264, i think you may be over estimating the importance of AQ and Psy-RD as the bit rate is increased. the real benefit is that you should be able to crank up the quality settings to max and not see that much of a performance slow down, my experience with main concept's cuda encoder is that the performance difference between "fastest" and "best" is negligible (with a gts 250) and the cuda encoder slows down a lot less when you increase the bit rate.

software encoders on the other hand, such as x264 and main concept's, see a huge performance drop when going from "fastest" to "best" or if cranking up the bit rate.

kypec
3rd February 2011, 18:53
haven't you heard? all 6 series boards, P67/H67 have been recalled and pulled from the market due to faulty 6gbs sata controllers, you're not getting squat until about april.
Wrong -> Both 6Gbs SATA ports are fine in those chipsets. Only 3Gbs SATA ports (4 in total) are affected by the silicon bug.

nm
3rd February 2011, 19:04
as for quality falling short of x264, i think you may be over estimating the importance of AQ and Psy-RD as the bit rate is increased.

I think you are underestimating their importance. I have yet to see a high-bitrate CUDA encode with acceptable quality compared to the source.

software encoders on the other hand, such as x264 and main concept's, see a huge performance drop when going from "fastest" to "best" or if cranking up the bit rate.

That's because they have a much wider scale of algorithms and tunings implemented.

LoRd_MuldeR
3rd February 2011, 19:04
Wrong -> Both 6Gbs SATA ports are fine in those chipsets. Only 3Gbs SATA ports (4 in total) are affected by the silicon bug.

Still Intel has recalled all the series 6 chipsets that have been delivered so far, even though "only" the 3 GBit/s SATA ports are defective.

As far as I know, most (if not all) shops have stopped selling the effected boards immediately and it will take about 6 weeks for the "fixed" boards to become available.

If you already have one of the effected boards, it will be up to you whether you care about the problem and return the board to the manufacturer or not...

deadrats
3rd February 2011, 21:48
I think you are underestimating their importance. I have yet to see a high-bitrate CUDA encode with acceptable quality compared to the source.

that really depends on what your definition of "high bit rate" is; if you're of the school of thought that 10 mb/s is more than enough for 1080p and your source is an already compressed blu-ray that was 15-20 mb/s at 1080p AND you're using an app that uses the reference cuda encoder (such as media coder) then yes, you will notice deficiencies in the cuda encode versus the source.

if however you use a good quality cuda encoder, like the one elemental developed for adobe or the main concept developed one found in magix's and roxio and you use a sane bit rate, perhaps 8-12 mb/s for 720p encodes and you use a good high quality clear source, then you will be more than satisfied with the results.

That's because they have a much wider scale of algorithms and tunings implemented.

it's more than that, the algorithm argument falls flat because in other types of scenarios, such as seti or folding, data encryption/decryption or dna sequencing, where the algorithms are identical, gpu's still shine, are in some cases many orders of magnitude faster than software based solutions and don't suffer the same slow downs as the workload is increased.

same holds true for 3d rendering, ray tracing, web browsing.

want to know the biggest thing holding back gpu powered encoders? it's that programmers don't know how to write gpgpu code properly...yet.

when a new student enters into a comp sci program at any college, they are taught C/C++, object oriented programming, computer architecture and assembler, compiler design and so on but it all focuses on, whether the student realizes it or not, on the x86 instruction set (some schools used to have a few electives in risc programming principles).

gpgpu programming classes aren't offered until the graduate level, in other words the vast majority of comp sci majors will never have a single programming class on general purpose coding on the gpu.

until we start seeing entry level gpgpu classes, for instance available within the first 2 years of a comp sci degree program, most gpu powered apps aimed at the general consumer market will suffer.

as it stands now, if you walk into a barnes and nobles there are literally hundreds of books available on c, c++, visual basic, c#, java, ruby, perl and every single one deals with programing for the x86 architecture, in contrast i could only find one book, and that had to be specially ordered, on cuda programming.

don't blame the hardware, if a graphics card can render realistic 3d scenes, with millions of polygons per second, at speeds greater than 150 fps, then it can be programmed to encode video at very high quality.

check out this gpu powered mpeg-2 encoder to see what a gpu powered encoder is capable of doing:

http://www.gputech.com/gpeg2/

you should note that it use dx9 for acceleration, not cuda/open cl/dx compute, it's a vfw codec and if you do a test encode and analyse the stream you will find that it's based on ffmpeg (which means they are violating the gpl by not releasing the source back to the community).

but it does serve as a perfect example of what a gpu powered encoder is capable of.

mariush
3rd February 2011, 22:40
it's more than that, the algorithm argument falls flat because in other types of scenarios, such as seti or folding, data encryption/decryption or dna sequencing, where the algorithms are identical, gpu's still shine, are in some cases many orders of magnitude faster than software based solutions and don't suffer the same slow downs as the workload is increased.


Can you please stop it already? You keep bringing this over and over again and people have explained your several times.
Encoding is inherently a SERIAL process with SOME parts that can be done parallel. Seti and folding is all parallel stuff, relatively easy to implement on a GPU.


want to know the biggest thing holding back gpu powered encoders? it's that programmers don't know how to write gpgpu code properly...yet.


I'd say some smart people need to actually invent some algorithms to be able to take advantage of gpu encoding - very few people have experience in programming with 200-1000 cores.
Lots of programmers can use the IDE and libraries freely available but that doesn't mean they can produce efficient and fast code.

LoRd_MuldeR
4th February 2011, 14:53
2 things come to mind:

1) this is my thread, if i don't care if it goes "off topic" (as conversations normally evolve over their courses) then why should anyone else?

Because at this forum we have rules that were made to keep the discussion worthwhile for everybody. And there's nothing more frustrating than finding a thread that appears to discuss exactly the topic you were after, just to realize that the discussion is going completely off-topic. Therefore we have the rule that you must give your thread a title that precisely describes the content of your posts and that you must stick to the topic of a thread. Consequently off-topic posts will be moved to a separate thread (if worth the effort) or simply be deleted. Open a new you thread, if you want to discuss a different topic!

So back to topic please :)