View Full Version : Any independent test of nVidia Elemental Accelerator?
FredThompson
24th September 2009, 04:18
http://newsletters.creativecow.net/sponsors/2009/09-21/index.html
nVidia claims a 500-1100% speed increase using their Elemental Accelerator with a Quadro FX card.
Does anyone have a copy coming who would volunteer to share some independent tests?
Chengbin
24th September 2009, 04:19
Unless you value speed, I don't recommend it.
BTW, a Quadro FX card is ridiculously expensive. A card easily costs a very decent computer.
aegisofrime
24th September 2009, 04:50
Obviously there's no reason why it will not work on a "normal" gaming card, since they use the same chips and all. But for some reason, nVidia will not offer it for the gaming cards. :/
FredThompson
24th September 2009, 04:59
@Chengbin, does your post seem as silly as it does to me. The card is for speed and is 1/5 or less the price of the equivalent processing done on PCs.
You've heard of "apples and oranges"? Your statement is comparing a car to an airplane.
@aegisofrime, decoding is probably the same but methinks it's more likely the "gaming" cards are crippled versions of the same core. I've still got a circuit trace repair pen from when the AMD chips were found to be hackable by connecting surface poles.
Dark Shikari
24th September 2009, 06:24
500-1100% faster than what, Quicktime?
Seriously, these numbers are meaningless. Last time I saw similar claims, the encoder was actually slower than x264, but they just picked the slowest encoder out there to compare it with.@Chengbin, does your post seem as silly as it does to me. The card is for speed and is 1/5 or less the price of the equivalent processing done on PCs.I already have a CPU. It cost $0, because I already have it. If I buy a Quadro, I'm buying something I didn't already have. Even if it actually were significantly faster than x264 (which I highly doubt), it's outright lying to say that it's "cheaper".
FredThompson
24th September 2009, 09:38
Chengbin totally missed that the primary thesis is speed, clearly shown by the statement, "Unless you value speed, I don't recommend it."
The second thesis is total cost to operate, that point was missed as well: "BTW, a Quadro FX card is ridiculously expensive. A card easily costs a very decent computer."
You appear to have also missed both. Two clicks and you could have read nVidia's methodolgy.
Claim is ~$700 to have 500% the encoding power of a quad-core system. The same processing power in the equivalent time would require 5 additional quad-core systems at $140 each, far below their actual cost. Even so, the power, complexity and physical space required would be greater, all of which add to TCO.
You should chose your words better and read for comprehension. This thread is about nVidia's claim and their products, not your personal situation. I most certainly did not lie but you most certainly did slander me. Keep your straw dog to yourself.
Dark Shikari
24th September 2009, 09:43
Claim is ~$700 to have 500% the encoding power of a quad-core system. The same processing power in the equivalent time would require 5 additional quad-core systems at $140 each, far below their actual cost. Even so, the power, complexity and physical space required would be greater, all of which add to TCO.
You should chose your words better and read for comprehension. This thread is about nVidia's claim and their products, not your personal situation.The entire claim lies on the (likely outright bull) "fact" that the card is "500% faster" than a quad core CPU. Every single properly performed test done so far has shown that GPU encoders are barely--at best--competitive with CPU encoders, so it is highly doubtful that the tables have turned in one or two months.
I suggest you stop believing every word that comes out of corporate marketing departments.you most certainly did slander me. Keep your straw dog to yourself.And now you start insulting other forum members and accusing them of "slander" for doubting the almighty word of nVidia.
This is ridiculous--my claim that nVidia is lying about it being "cheaper" is slander? I mean, even if my claim was false, which it isn't, you should probably look up the definition of the word "slander" before you use it incorrectly.
Furthermore, I don't see any reason you should be standing up for nVidia here and trying to protect them from my "slander" of them. If they want to accuse me of "slander", let them do it themselves.
G_M_C
24th September 2009, 09:49
What about this marketing done by Ati on their launch yesterday ?
Perhaps the most interesting instruction added however is an instruction for Sum of Absolute Differences (SAD). SAD is an instruction of great importance in video encoding and computer vision due to its use in motion estimation, and on the RV770 the lack of a native instruction requires emulating it in no less than 12 instructions. By adding a native SAD instruction, the time to compute a SAD has been reduced to a single clock cycle, and AMD believes that it will result in a significant (>2x) speedup in video encoding.
Would that be beinifial to use through OpenCL / DirectCompute ?
Or the abillity to do 1 clock FFT transformations ?
Dark Shikari
24th September 2009, 09:51
What about this marketing done by Ati on their launch yesterday ?
Would that be beinifial to use through OpenCL / DirectCompute ?
Or the abillity to do 1 clock FFT transformations ?That looks quite nice. CUDA has had _usad() for quite some time, but I don't know if that actually translates to a single instruction or not, since nVidia works extremely hard to prevent us from looking at the actual assembly code being sent to the GPU.
The main problem with a fast SAD is that it makes the other parts of the motion estimation process a much larger bottleneck than before.
(This definitely makes motion search on the GPU a bigger option for x264, but there is still the problem that the analysis we've done on the process suggests that it will be very difficult to fully parallelize the process. It's only easy to parallelize if we do it on the source frames instead of the reconstructed frames... which vastly reduces compression.)
FredThompson
24th September 2009, 10:36
The entire claim lies on the (likely outright bull)Conjecture, not proven."fact" that the card is "500% faster" than a quad core CPU. Every single properly performed test done so far has shown that GPU encoders are barely--at best--competitive with CPU encoders, so it is highly doubtful that the tables have turned in one or two months.Good. We agree you are projecting conjecture, not demonstrating fact. Read the first 2 sentences of the thread. The first states nVidia's claim, the second asks if anyone will be VERIFYING it. Conjecture is not verification.I suggest you stop believing every word that comes out of corporate marketing departments.Second time you slander me.And now you start insulting other forum membersIdentifying your slander and presentation of conjecture as fact is not an insult. You may find it uncomfortable, true, but it is neither slander nor an insult. Identifying flaws in Chengbin's comment is neither slander nor insult.and accusing them of "slander" for doubting the almighty word of nVidia.
This is ridiculous--my claim that nVidia is lying about it being "cheaper" is slander?No, I state you slandered me with your comment to me. You didn't quote or disprove nVidia's claims. You didn't even address their methodology. Your statement, "it's outright lying to say that it's "cheaper"." directly relates to your quotation of my statements. I mean, even if my claim was false, which it isn't,That is unprovable conjecture. You don't have the device and haven't tested it. You are projecting that which you expect to be as real.you should probably look up the definition of the word "slander" before you use it incorrectly.Sorry, you're wrong about that, too. slander is "a malicious, false, and defamatory statement or report" which is what you did in your comment to the portion of my statements which you quoted.Furthermore, I don't see any reason you should be standing up for nVidia here and trying to protect them from my "slander" of them. If they want to accuse me of "slander", let them do it themselves.Another straw dog. I clearly stated you slandered me. I showed you the basic math of how the device would be "cheaper", to use your word, in TCO if nVidia's claim is proven to be true. Their claim is not disproven yet, regardless of your projections and attempts to change my words into something they are not.
I will no longer reply to you regarding this. It is clear you are not thinking clearly now.
G_M_C
24th September 2009, 10:43
That looks quite nice. CUDA has had _usad() for quite some time, but I don't know if that actually translates to a single instruction or not, since nVidia works extremely hard to prevent us from looking at the actual assembly code being sent to the GPU.
The main problem with a fast SAD is that it makes the other parts of the motion estimation process a much larger bottleneck than before.
(This definitely makes motion search on the GPU a bigger option for x264, but there is still the problem that the analysis we've done on the process suggests that it will be very difficult to fully parallelize the process. It's only easy to parallelize if we do it on the source frames instead of the reconstructed frames... which vastly reduces compression.)
I did look promising, thats why i posted this. I've got positive feelings about DirectCompute / OpenCL in general. Cause it makes using the vast power of a GPU more of an option. There will be all kinds of problems or setbackt, i shure of it, but the fact the DirectCompute is part of DirectX makes GPGPU a much better idea.
Before this implementation intop DX11, GPGPU had the feeling like good old DOS, where you had to write your game for every graphic chipset separately (Tseng Labs, Ati Mach64, you needed to put drivers for too many of them). Unifying GPGPU in DX11 might help to get things going finally, maybe it also forces Nv to be more open about the assembly instructions used/needed :)
And in that respect it might me a nice idea to put some time/research/brainstorming into :)
Dark Shikari
24th September 2009, 10:53
I will no longer reply to you regarding this.You probably should have done that before, not after, repeatedly violating the forum rules. It's quite apparent that you are not actually interested in a technical discussion despite my attempts to respond with technically-oriented posts in this thread (http://forum.doom9.org/showthread.php?p=1328478#post1328478).
At this point it is quite clear that you are here to promote nVidia products; you practically admitted it when you took offense after I called nVidia's claims a lie. Only a shill takes personal offense to claims directed against a corporation.
Also, slander is spoken, libel is written (http://en.wikipedia.org/wiki/Slander).
foxyshadis
24th September 2009, 11:00
It's not just the final codec that's accelerated, it's the whole pipeline, including decoding and heavy filtering. However, Premier has had gpu-assisted rendering for a decade now, that's nothing new, and completely turning it off is very disingenuous. However, there's nothing to prove or disprove, the entire claim is pure marketing: There's no hard facts, and plenty of room for people to read in their own ideas.
There is one useful chart:
http://www.nvidia.com/docs/IO/62559/chart2.jpg
But how high is the quality compared to Adobe's high quality preset? Nobody knows, that's not included; they'd just like you to believe they're equivalent. Again, it's regular marketing.
Fred, your hypersensitivity is borderline trolling. Calm down before you are struck.
FredThompson
24th September 2009, 13:35
It's not just the final codec that's accelerated, it's the whole pipeline, including decoding and heavy filtering. However, Premier has had gpu-assisted rendering for a decade now, that's nothing new, and completely turning it off is very disingenuous. However, there's nothing to prove or disprove, the entire claim is pure marketing: There's no hard facts, and plenty of room for people to read in their own ideas.
There is one useful chart:
http://www.nvidia.com/docs/IO/62559/chart2.jpg
But how high is the quality compared to Adobe's high quality preset? Nobody knows, that's not included; they'd just like you to believe they're equivalent. Again, it's regular marketing.If testing shows results such as those at http://digitalcontentproducer.com/affordablehd/newsletter/test_drive_nvidia_quadro_cx_0126/index2.html, it's not "regular marketing", it's much closer to a lie by omission. That article mentions nVidia's baseline is crippled.
Fred, your hypersensitivity is borderline trolling. Calm down before you are struck.The topic of this thread is in the first two sentences. You're looking in the wrong direction and at the wrong topic.
FredThompson
24th September 2009, 13:37
Here's another brief mention translated from German which says nVidia was claiming 2-11 fold increase earlier this year: http://www.slashcam.com/news/single/H-264-Encoding-using-CUDA-with-Elemental-Accelerat-7701.html
G_M_C
24th September 2009, 14:59
Here's another brief mention translated from German which says nVidia was claiming 2-11 fold increase earlier this year: http://www.slashcam.com/news/single/H-264-Encoding-using-CUDA-with-Elemental-Accelerat-7701.html
nVidia was claiming 2-11 fold the speed of a quadcore CPU. This is is clearly stated in that article; So you also could have been more clear in your post couldn't you ? You suggest more than it is with your "2-11 fold increase". Increase ... of what exactly? 11 times zip is still zip. Your post actually says nothing.
2-11 fold the speed of a quadcore CPU is clearer ... GPGPU is supposed to be faster than that isn't it ? 2 x speed of nehalem, upcoming Sandy Bridge will probably be 2 x faster than Nehalem too (and it might be 11 times faster than AMD's slowest X4).
Let Nv and Ati get SAD (and other usefull commands) into OpenCL first, so that everybody can use it; And not only those who can shell out for some exotic piece of hardware, and some equally expensive piece of software that can use it. As long as it remains that much of a niche-product, almost nobody will actually care.
nm
24th September 2009, 15:24
nVidia was claiming 2-11 fold the speed of an Intel quadcore. GPGPU is supposed to be faster than that isn't it ?
It depends on the task. Some calculations can be done faster on a GPU, others can't. As DS said, we have yet to see a GPU encoder that would be significantly faster than software encoding at the same level of quality. Also, GPU encoders tend to have no settings for adjusting the speed/quality tradeoff. Some implementations even choose a bitrate for you.
Here's some previous discussion on the NVIDIA CUDA encoder: http://forum.doom9.org/showthread.php?t=148276
Elemental accelerator may use a completely different encoder implementation, so until we see proper tests...
Edit: Information on the encoder parameters can be found in the user guide: http://www.elementaltechnologies.com/sites/default/files/Elemental%20Accelerator%20User%20Guide-en.pdf
Looks like there's AQ, 2-pass VBR and support for interlaced encoding.
Marketing blog: http://www.elementaltechnologies.com/blog/accelerator
CpT
24th September 2009, 18:02
500-1100% faster than what, Quicktime?
Seriously, these numbers are meaningless. Last time I saw similar claims, the encoder was actually slower than x264, but they just picked the slowest encoder out there to compare it with.
This is true, even my overclocked q9400 @ only 3.2 gets comparable fps to badaboom on my gtx 280 when encoding 1280x videos while using med-high x264 settings. And med-high settings with x264 utterly blows away every cuda encoder that's currently available.
I have to almost double the bitrate with badaboom or mediacoder to achieve comparable results to x264.
I really really hope cuda encoders evolve into something great, but its not there yet.:cool:
TheResidentEvil
24th September 2009, 19:16
Same. its a shame really, so much potential for so long and its of little to no use.
FredThompson
24th September 2009, 20:07
nVidia was claiming 2-11 fold the speed of a quadcore CPU. This is is clearly stated in that article; So you also could have been more clear in your post couldn't you ? You suggest more than it is with your "2-11 fold increase". Increase ... of what exactly? 11 times zip is still zip. Your post actually says nothing.Yes, could have been more clear. Was referring to how their press release through Creative Cow claims 500%+ and a few months ago they were claiming 200%+. That's all I was trying to point out. Should have been obvious, apparently was not. Taken by itself, that particular statement is not grounded, no different than a "50% better" sticker on a shelf-display. 50% better than what? Measured how? My favorite example of that are Mary Kay window stickers claiming it is "America's best selling brand." It's not the highest volume so best selling how?Let Nv and Ati get SAD (and other useful commands) into OpenCL first, so that everybody can use it; And not only those who can shell out for some exotic piece of hardware, and some equally expensive piece of software that can use it. As long as it remains that much of a niche-product, almost nobody will actually care.Not a lot of downward price pressure with only 2 competitors. You're right, if the hardware could crunch as much as they claim, they'd move a lot more of them with free software. They appear to use $75/hour as their comparison. May make sense in that situation but not in our free labor hobbyist arena.
popper
24th September 2009, 21:29
What about this marketing done by Ati on their launch yesterday ?
"Quote:
Perhaps the most interesting instruction added however is an instruction for Sum of Absolute Differences (SAD). SAD is an instruction of great importance in video encoding and computer vision due to its use in motion estimation, and on the RV770 the lack of a native instruction requires emulating it in no less than 12 instructions.
By adding a native SAD instruction, the time to compute a SAD has been reduced to a single clock cycle, and AMD believes that it will result in a significant (>2x) speedup in video encoding.
"
Would that be beneficial to use through OpenCL / DirectCompute ?
Or the abillity to do 1 clock FFT transformations ?
i assume this is the URL you took this from ?
http://www.anandtech.com/video/showdoc.aspx?i=3643&p=5
while the other new options also mentioned there
"Last, here is a breakdown of what a single Cypress SP can do in a single clock cycle:
4 32-bit FP MAD per clock
2 64-bit FP MUL or ADD per clock
1 64-bit FP MAD per clock
4 24-bit Int MUL or ADD per clock
SFU : 1 32-bit FP MAD per clock
Moving up the hierarchy, the next thing we have is the SIMD.
Beyond the improvements in the SPs, the L1 texture cache located here has seen an improvement in speed.
It’s now capable of fetching texture data at a blistering 1TB/sec.
The actual size of the L1 texture cache has stayed at 16KB.
Meanwhile a separate L1 cache has been added to the SIMDs for computational work, this one measuring 8KB.
Also improving the computational performance of the SIMDs is the doubling of the local data share attached to each SIMD, which is now 32KB
"
looks like they may also help OpenCL assisted Encoding throughput etc,(if someone writes the POC code patchs and commits time and effort to a stable patch cycle) its a shame you need to spend new money on the latest ATI/AMD rv 770 card revisions at the highest price point to gain these abilitys though...
regarding purely SW for the many non rv770 chips, has anyone actually written any working test SAD OpenCL kernels for AMD/ATI ?, i couldnt find anything so far.
G_M_C
24th September 2009, 21:35
i assume this is the URL you took this from ?
http://www.anandtech.com/video/showdoc.aspx?i=3643&p=5
while the other new options mentioned there
"Last, here is a breakdown of what a single Cypress SP can do in a single clock cycle:
4 32-bit FP MAD per clock
2 64-bit FP MUL or ADD per clock
1 64-bit FP MAD per clock
4 24-bit Int MUL or ADD per clock
SFU : 1 32-bit FP MAD per clock
Moving up the hierarchy, the next thing we have is the SIMD.
Beyond the improvements in the SPs, the L1 texture cache located here has seen an improvement in speed.
It’s now capable of fetching texture data at a blistering 1TB/sec.
The actual size of the L1 texture cache has stayed at 16KB.
Meanwhile a separate L1 cache has been added to the SIMDs for computational work, this one measuring 8KB.
Also improving the computational performance of the SIMDs is the doubling of the local data share attached to each SIMD, which is now 32KB
"
looks like they may also help OpenCL assisted Encoding throughput etc,(if someone write the POC code patchs and commits time and effort to a stable patch cycle) its a shame you need to spend new money on the latest ATI/AMD rv 770 card revisions at the highest price point to gain these abilitys...
regarding purely SW for non rv770 chips has anyone actually written any working test SAD OpenCL kernels for AMD/ATI ?, i couldnt find anything so far.
Yep there are more additions :)
And Yes, at this moment in time, only Ati supports these possibilities. BUT :) they are part of DirectX11. And so they will be available on all future cards that support DX11 or up, including Larrabee and Nv's products. So considering using these additions is more an option as it was before. But still, there are probably more hurdles then i can oversee. And DX11 isn't officially out yet .... so the real advantages remain to be seen ;)
PS: The important line in your quote is this one; a single Cypress SP. How many S(tream)P(processors) does it have ... 1600 for the 5870 was it not ? Those numbers add up to some power!
yuvi
25th September 2009, 05:25
regarding purely SW for the many non rv770 chips, has anyone actually written any working test SAD OpenCL kernels for AMD/ATI ?, i couldnt find anything so far.
I doubt it; the only OpenCL implementation for Radeons is in Snow Leopard, and from what I've heard it has a host of performance issues and other bugs.
Do note that r700 had a native MAX_INT instruction, so unless that's really slow you only need 3 instructions not 12 for a SAD right now since video doesn't need 33 bit intermediates:
SUB_INT diff, a, b
SUB_INT ndiff, 0, diff
MAX_INT sad, diff, ndiff
Of course, the instruction could be 8-bit SIMD, but I find that unlikely.
Also contrary to the article, OpenCL has an abs_diff() intrinsic, but NVIDIA's compiler fails to take advantage of the accumulator in the PTX SAD instruction with that. Not that NVIDIA necessarily has a native SAD instruction...
Sagittaire
25th September 2009, 12:34
At this point it is quite clear that you are here to promote nVidia products; you practically admitted it when you took offense after I called nVidia's claims a lie. Only a shill takes personal offense to claims directed against a corporation.
lol ... definitively.
You must work for the FBI (XFiles pathology)? Go out a little and talk to real people. Why not with a real girl?
G_M_C
25th September 2009, 13:04
I doubt it; the only OpenCL implementation for Radeons is in Snow Leopard, and from what I've heard it has a host of performance issues and other bugs.
[...]
A shameless quote from that Anandtech's article i quoted from before. This page goes into this more.
(The whole article (http://www.anandtech.com/video/showdoc.aspx?i=3643) is a good read. I rather like reading up on things on Anandtech's site :) )
DirectCompute, OpenCL, and the Future of CAL
As a journalist, GPGPU stuff is one of the more frustrating things to cover. The concept is great, but the execution makes it difficult to accurately cover, exacerbated by the fact that until now AMD and NVIDIA each had separate APIs. OpenCL and DirectCompute will unify things, but software will be slow to arrive.
As it stands, neither AMD nor NVIDIA have a complete OpenCL implementation that's shipping to end-users for Windows or Linux. NVIDIA has OpenCL working on the 8-series and later on Mac OS X Snow Leopard, and AMD has it working under the same OS for the 4800 series, but for obvious reasons we can’t test a 5870 in a Mac. As such it won’t be until later this year that we see either side get OpenCL up and running under Windows. Both NVIDIA and AMD have development versions that they're letting developers play with, and both have submitted implementations to Khronos, so hopefully we’ll have something soon.
It’s also worth noting that OpenCL is based around DirectX 10 hardware, so even after someone finally ships an implementation we’re likely to see a new version in short order. AMD is already talking about OpenCL 1.1, which would add support for the hardware features that they have from DirectX 11, such as append/consume buffers and atomic operations.
DirectCompute is in comparatively better shape. NVIDIA already supports it on their DX10 hardware, and the beta drivers we’re using for the 5870 support it on the 5000 series. The missing link at this point is AMD’s DX10 hardware; even the beta drivers we’re using don’t support it on the 2000, 3000, or 4000 series. From what we hear the final Catalyst 9.10 drivers will deliver this feature.
Going forward, one specific issue for DirectCompute development will be that there are three levels of DirectCompute, derived from DX10 (4.0), DX10.1 (4.1), and DX11 (5.0) hardware. The higher the version the more advanced the features, with DirectCompute 5.0 in particular being a big jump as it’s the first hardware generation designed with DirectCompute in mind. Among other notable differences, it’s the first version to offer double precision floating point support and atomic operations.
AMD is convinced that developers should and will target DirectCompute 5.0 due to its feature set, but we’re not sold on the idea. To say that there’s a “lot” of DX10 hardware out there is a gross understatement, and all of that hardware is capable of supporting at a minimum DirectCompute 4.0. Certainly DirectCompute 5.0 is the better API to use, but the first developers testing the waters may end up starting with DirectCompute 4.0. Releasing something written in DirectCompute 5.0 right now won’t do developers much good at the moment due to the low quantity of hardware out there that can support it.
benwaggoner
26th September 2009, 09:35
If I might try to shed a little more light and less heat...
One thing to note about the Elemental technology is that they're accelerating the whole pipeline: source decode, preprocessing, and encoding. So there's quite a bit of room for performance automation for a complete encode than just the codec itself. And generally speaking source decode and preprocessing are much easier to parallelize than encoding.
Quality-wise, one shouldn't take Badaboom as representative of their current work (I think the current Badaboom is nearly a year old). I've seem some pretty good H.264 and VC-1 come out of their most recent builds. But it can really shine at stuff like doing a large number of multiple live encodes on a single box, for multibitrate distribution.
vBulletin® v3.8.5, Copyright ©2000-2012, Jelsoft Enterprises Ltd.