Differences between GPU and CPU multithreading

I’d like comments on the following statement regarding porting OpenMP code to CUDA:

“The following concepts are largely irrelevant for GPU threads: lock, semaphore, mutex, fork, join, message queue. Therefore ‘porting’ a typical multi-threaded algorithm from OpenMP to CUDA is no easier (and probably somewhat harder) than working from a simple single-threaded prototype of the algorithm. The multi-threading aspect of the OpenMP code will be pretty much completely unrelated to the CUDA implementation.”


A GPU algorithm needs more fine-grained parallelism than something made with OpenMP for multi-core CPUs. But if you have a parallel CPU algorithm you must have identified some coarse-grained parallelism at least. The most glaring difference on the GPU apart from the fine-grained aspect is that you have very little memory per thread to work with.

Therefore I say: first, ‘porting’ is the wrong term in this context. Secondly, if you have an algorithm that can be efficiently implemented with OpenMP you have already some of the brain work done. Thirdly, an efficient GPU algorithm might still look very different or might not be possible at all. So I disagree with the quoted statement. While you cannot copy&paste OpenMP code, tinker with it and expect it to work efficiently or work at all on the GPU, your task will be easier than if you have a sequential algorithm.

You might want to take a look at this paper: http://cobweb.ecn.purdue.edu/~smin/papers/…gpu-ppopp09.pdf