I do not think CUDA framework is well designed and mature enough, which lead to development cost very high and not easy to learn and use.
One example: Until CUDA SDK 6.0, it start to support Unified Memory. before that programmer still needs spend time and energy to consider how to allocate memory and write specific code to move variables between CPU and GPU memory. These Kind of low level jobs should be handled by CUDA runtime itself, but not bother developer. A mature framework should let developer focus on program logic but not memory things. I am not sure how many people indeed use unified memory.
Another issue is more silly. That is developer still need to specify <<<blocks, thread per block>>> values explicitly. Such as that vector add function: VecAdd<<<1, N>>>(A, B, C) example. You need to tell the vector length explicitly, then CUDA will distribute them into N thread to do mini-add. WHY CUDA Runtime CANNOT judge vector length by itself??? IS IT VERY HARD??? Why CUDA ask developer to tell how many compute resource each function needs??? Why CUDA runtime itself CANNOT determine and automatically optimize how many resources(gird/block/thread) it should be.
This is only a simple example, if you do M*N-Dim matrix complex calculation. it is very boring to determine how many blocks or thread…
EVERTHING CUDA needs developer to assign. Is CUDA a infant or CUDA Designers are infants?
I just want to know who are those CUDA designers??? Do you know what should encapsulate by CDUA runtime itself and what should expose to developers???
BTW, why it uses three ‘<’ and ‘>’. I need press ‘<’ and ‘>’ keys for 6 times. Can you use shorter and less ugly expression?