I’m also having trouble distinguishing between what features I want and what I want to accomplish. To the extent that CUDA is Turing complete, there’s nothing I can’t do in it. All requests ultimately come down to either “Make it easier to do” or “Make it faster.”
That said, there are many algorithms that don’t map well to CUDA. There are several reasons for this.
-
The algorithm may be inherently serial, so that it can’t be split into many parallel threads.
-
The algorithm may be parallelizable, but not to the extent needed to get good performance with CUDA (thousands of parallel threads).
-
The algorithm may require global communication between threads: it can’t be split into independent blocks of threads with no interaction between blocks.
-
Work may be generated in increments that are too small to efficiently use the GPU (e.g. a server application where lots of small jobs are received asynchronously)
-
The algorithm’s memory access patterns may map poorly to the GPU’s memory architecture (e.g. lots of random reads and writes that can’t be coalesced).
2, 3, and 5 are the ones that are causing the most problems for me. I can suggest lots of specific features to address them. Example include:
-
A lightweight global thread barrier for synchronization between blocks.
-
Atomic operations for types other than ints (especially floats).
-
The ability to run multiple kernels at once, with each one using a subset of the SMs.
-
Reducing the cost of global memory access.
But ultimately what it comes down to is that some of the computations I need to do can’t be done efficiently on the GPU, and I’ll take any change that makes them more efficient.
Peter