CUDA coding philosophy

Am I right when I say that in CUDA, to achieve the optimum code, we have to play around with different kind of possible optimizations. I am asking this because while coding in C, this was not as much the case and so it was easier to have a design setup before we started to code.

The problem with this “trial and error” kind of approach is that it is very hard to come up with a design before we begin coding, and I have had some struggle due to that.

A few guidelines would be:

  1. Is my design exposing parallelism enough to exploit the GPU?
  2. Coalesced memory access Possible?
  3. Scope for data staging in “Shared Memory”?
  4. Can constant memory / Texture memory be exploited?
  5. What is the correct block size?

Out of the 5, only the last one requires some tuning… For others, the parallel design will have the answer.