The key question is what happens inside the loops. If the inner loop is completely independent of all the other iterations, then just launch one thread per inner loop and you’re done. (Careful optimization aside.)
I put this up in the Linux page because I’m using an 8800 GPU alongside my quad-core running XFCE ubuntu Linux. Please feel free to move, if this forum is not appropriate: I’m still not 100% familiar with the goings-on round here…
It’s interesting that you mentioned about the data in the inner loop needing to be independent. Thanks for articulating that. I understand from browsing through the manual that the data types and what the data can access is quite restricted. Something I’m going to have to look at, cos my inner loop is calling functions (relatively simple functions).
I presume it’s very difficult (if not impossible) to reference c++ objects from inside the inner loop?
The C++ support in nvcc is very incomplete and accessing objects on the GPU might not work. (On the other hand, templates and operator overloading do work, and are very valuable.)
Even if it did work, the memory architecture makes parallel object access very slow. You’ll see people mention the term “coalesced memory reads” when talking about CUDA. The memory controller is optimized for large contiguous reads. To get anywhere near the theoretical memory bandwidth, you really need threads to read contiguous elements in arrays of 32, 64 or 128 bit data types. This makes arrays of structs (and basically arrays of objects, if that were supported) inefficient in many cases, whereas a structure of arrays would be better.
I call CUDA kernels from C++ code, and during the setup phase I transform my C++ data structures into a more efficient array layout before moving the data to the GPU.
Feel free to use C++ constructs. Almost everything works except anything that uses function pointers or virtual functions (ie polymorphism).
Structs/classes do make it harder to do coalescing sometimes, but don’t worry about it for now. Just study the coalescing rules, and you’ll be able to tell on your own which situations are problematic. (There’s many situations where using structures doesn’t affect performance at all, and I don’t think anyone should just abandon these hugely useful things!)
The inner loops don’t have to be 100% independent. You just must realize that threads will run simultaneously, and if one thread critically depends on the results of another, it will have a tough time working. Sometimes the inner loop may be loosely dependent on previous iterations, like when it keeps adding to the same variable. In this case the addition is not really a problem because additions can be rearranged and left for the end.