CUDA kernel call in subroutine

Hello.

I know that after the kernel call the control is returned to host immediately.

But if a kernel is called within a subroutine, doesn’t the subroutine end until the kernel finishes?

I’m trying to do host-device concurrent calculation, but what I’ve found is that code becomes slower and slower as I give more works to host.

My test GPU is not that high-end, so giving host some works must reduce runtime.

Hi CNJ,

Kernels will run asynchronously to the host, even across host side subroutine boundaries. However, the host will block if you are copying back data, if you call cudaStreamSynchronize, or other such things.

  • Mat

Yes, you are right.

Calling kernel in subroutine was not a problem.

https://forums.developer.nvidia.com/t/is-kernel-launch-really-asynchronous-in-cuda-fortran/134894/1

My issue was the same case to this article.

But concurrent calculation is still slower than one-side calculation.

As the host runtime increases, the kernel runtime increases as well.

Actually, if I don’t do anything on the host while kernel is running, overall runtime is the shortest.

It seems that kernel starts after host finishes its work even though kernel is called first.

Are there any communications between host and device while kernel is running?

I call cudaStreamSynchronize only when host side is finished.

Have you tried profiling your code using PGPROF? The generated timeline should give you clues as to what’s happening.

  • Mat

If I compile my code with profiling option,

the program dies at runtime with

Error: internal compiler error: invalid thread id.


Does WDDM driver impose some restrictions on host-device concurrent calculation?

Instead of PGPROF, I tried to analyze my code with NVIDIA Nsight. It doesn’t give detailed information, but anyhow it became clear that kernel is generated normally, but stays idle during host is doing some works and starts to run after when host reaches cudaStreamSynchronize.

And I think I found a solution. If I call cudaStreamQuery after kernel launch, the concurrent calculation is performed successfully.

I found this solution at StackOverflow.

What is the exact purpose of cudaStreamQuery? I thought this function is merely for testing whether stream is finished or not. They say something about flushing, but I’m not familiar to it. Can you explain this feature a bit?

If I compile my code with profiling option

What options? As of PGI 2016, PGPROF is a sample based profiler so doesn’t need the -Mprof flags.

Does WDDM driver impose some restrictions on host-device concurrent calculation?

Sorry, I don’t know. I only use the Tesla (TCC) driver.

What is the exact purpose of cudaStreamQuery? I thought this function is merely for testing whether stream is finished or not. They say something about flushing, but I’m not familiar to it. Can you explain this feature a bit?

As far as I know, it just tests the if the stream queue is finished. Can you ask you question on StackOverflow or post a link so I can see the context of the answer?

  • Mat

I’m using PVF 15.10.

And following is the link.

http://stackoverflow.com/questions/19944429/cuda-performance-penalty-when-running-in-windows

Here is another answer from official NVIDIA developer, though this topic is related with concurrency between kernels, not between device and host.

https://devtalk.nvidia.com/default/topic/538232/concurrent-kernels/?offset=4[/code]

Ok, for PVF 15.10 use NVProf. The older pgprof on doesn’t support CUDA profiling.

Thanks for the links. I’ll note it if another user see a similar issue.