I’d like to get your general oppnion of a contradiction I am getting to.
From one hand CUDA is declared to be a high-level C language that supports loops, structures, branches, so documentation looks like it’s really easy to write software for GPU.
However, if you look at the reduction example doc, especially at the last step of optimization that really delivers code faster then optimized CPU code you will realise that:
- you shouldn’t really use loops
- you shuoldn’t really use branching
- you shuoldn’t really use complex data structures because memory access then will be non-coalesced
- you should always operate with data aligned on at least 16 bytes or even powers of 2, so processing general arrays, like 17x35 is a nightmare
optimized CUDA code (like “reduction” sample step 7) more looks like specialized hardware configuration scripts (more like FPGA programming), rather than really C.
The questions are:
- why there was a need to support all these loops and branching constructs in language if their usage is prohibitive for performance?
- so far GPU wins in terms of cost of hardware relative to performance. How about the cost of man*hours to write really fast code on CUDA that is “hard to get it right” (I cite the reduction doc) relative to the similar cost of getting the fast code using Intel MKL library for instance or optimizing Intel C++ compiler on CPU? Can you share your experience here?