I have a dual-core CPU and would like to use OpenMP to run two CPU thread processes while a third thread runs the CUDA calls. My performance numbers aren’t quite clear about whether CUDA is letting me do this or not. So, my questions are two:
Does kernel execution on either of the recent releases (0.8 and 0.9) block a CPU? I realize that the calls are asynchronous as of v0.9, but I saw no change in my performance numbers between versions 0.8 and 0.9, so I have to ask.
I know that my data transfers (the non-asynchronous part of my CUDA code) take up an insignificant amount of time, so I am not sure what could be causing the performance numbers that I see. I use “omp parallel sections” to split my serial code into two threads, one contains only GPU/CUDA subroutines, the other CPU routines. The CPU thread splits again with an “omp parallel do,” and all threads rejoin at the end of the computation. Without OpenMP, the GPU part takes 24 seconds and the CPU part (1 thread) takes 58 seconds (running serially). With OpenMP, all 3 threads start simultaneously (I suspect) and the start-to-finish wall time for the GPU section is 24 seconds and the start-to-finish wall time for the CPU section is 34 seconds. If we assume that the CPU isn’t burdened by CUDA, then the 58 seconds of computation should take 29 seconds to complete, and not 34. But if we assume that CUDA blocks, then the CPU section should run one thread for 24 seconds and then two for another 17, totalling 58 seconds of work and 41 seconds of wall-clock time.
Am I approaching this incorrectly? Have I missed something? Or does CUDA still require a large amount of time from the CPU during kernel calls?