My work uses 4 streams and I wish they can be run concurrently. The code is something like this:
for (int i=0; i<N; i++) //batch numbers
{
for (int j=0; j<4; j++)
myCudaCode(stream[j]); // working codes using the specified stream
}
However, from nvvp profiler I see the streams are actually not concurrently running because the CPU are fully occupied by the kernel launches. I did not use any cudaDeviceSynchronize. You may see the figure from the following link.
https://www.dropbox.com/s/qkg1d4y8yp6g43c/nvvp_SingleThread.png?dl=0
I understand that all my kernels on GPU are pretty small, comparable to the kernel launch times on CPU. But so far we do not intend to change them. From the above figure, I see most kernel launches on CPU take around 5~10 us, which is considered normal. The whole processing time for one batch is around 0.4 ms (as shown in the gray)
An intuitive thinking to optimize the code is to use multi-threading to parallelize the CUDA kernel launches on CPU. Here is what I did by use of openMP:
for (int i=0; i<N; i++)
{
#pragma omp parallel num_threads(4)
myCudaCode(stream[omp_get_thread_num()]);
}
Now the nvvp profiler shows like this:
https://www.dropbox.com/s/drenvzj3syuz9e4/nvvp_MultiThread.png?dl=0
The four streams are seemingly running concurrently. However, for each CPU thread, the CUDA kernel launches are not no longer as compact as before, and also are significantly stretched (typically 20~30 us). The resulting times required for one batch processing (shown in gray) are now around 0.5ms, even longer than the single thread case.
I also tried pthread method. It shows the similar problem.
So I’d like to ask for an effective way to parallize the kernel launches on CPU. Ideally, the times are expected be reduced by one fourth.