Doom9's Forum - View Single Post

LoRd_MuldeR · 6th August 2010, 15:36

Quote:

Originally Posted by neuron2

It's not difficult. I wrote an NV12 to RGB24 conversion (with configurable 601/709 coefficients) plus host transfer in two days. And that was starting with very little knowledge of CUDA. The code is so simple that I'm embarrassed that it took me that long (although a lot of that time was working out the correct YUV->RGB equations and optimizing the implementation).

So speak for yourself!

That really is an extremely simplistic example. And the problem fits perfectly on CUDA, as the RGB color value of a pixel only depends on the YUV color value of the same pixel, it does not depend on any other pixels. That's as "local" and "parallalizable" as it can be. I guess you didn't even use the Shared memory for that. There aren't much problems like that in the real world, unfortunately.

As soon as you do something slightly more complex and you try to do it in a way that runs "fast" on CUDA, things get much more ugly. Especially if you need to store intermediate data in "shared" memory, but shared memory is too small. Also all the "memory access pattern" things are very complex. You need to take care which threads (of a block) run in the same Warp and which memory addresses (banks) they access.

Again I want to point to this example:
http://developer.download.nvidia.com.../reduction.pdf

(And remember, all they implement is a simple Vector reduction! At the end they have a bunch of code, code that really isn't trivial to understand, while in plain C this would be ~3 lines of code ^^)