View Single Post
Old 6th August 2010, 15:36   #60  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,248
Quote:
Originally Posted by neuron2 View Post
It's not difficult. I wrote an NV12 to RGB24 conversion (with configurable 601/709 coefficients) plus host transfer in two days. And that was starting with very little knowledge of CUDA. The code is so simple that I'm embarrassed that it took me that long (although a lot of that time was working out the correct YUV->RGB equations and optimizing the implementation).

So speak for yourself!
That really is an extremely simplistic example. And the problem fits perfectly on CUDA, as the RGB color value of a pixel only depends on the YUV color value of the same pixel, it does not depend on any other pixels. That's as "local" and "parallalizable" as it can be. I guess you didn't even use the Shared memory for that. There aren't much problems like that in the real world, unfortunately.

As soon as you do something slightly more complex and you try to do it in a way that runs "fast" on CUDA, things get much more ugly. Especially if you need to store intermediate data in "shared" memory, but shared memory is too small. Also all the "memory access pattern" things are very complex. You need to take care which threads (of a block) run in the same Warp and which memory addresses (banks) they access.

Again I want to point to this example:
http://developer.download.nvidia.com.../reduction.pdf

(And remember, all they implement is a simple Vector reduction! At the end they have a bunch of code, code that really isn't trivial to understand, while in plain C this would be ~3 lines of code ^^)
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊

Last edited by LoRd_MuldeR; 6th August 2010 at 15:42.
LoRd_MuldeR is offline   Reply With Quote