How to know a kernel actually starts running in Cuda C++?


I am wondering if there is a way to know whether a kernel instance starts running?
When we launch a kernel, the CPU will send the instance to GPU. But because of resource limitation, the kernel instance may be blocked by other kernel instances and not be able to start running immediately.
So through Cuda c++ API, am I able to know the actual time when the kernel starts to run?
Also, can CPU query the GPU resource usage in Cuda c++?


Launch an event before and after the kernel.
Query the first event to determine if the kernel has started. Query the second event to determine if the kernel has finished. If you need to know a precise time, then use the event elapsed time function in the event system.

You can query memory usage with cudaMemGetInfo. For other types of GPU resources, possibly not, but you would have to be specific.

You mean something like this?

kernel1 <<< ..., stream1 >>> (...);

    cudaEventRecord(start, stream2);
    kernel2 <<< ..., stream2 >>> (...);  
    cudaEventRecord(stop, stream2);

    cudaEventElapsedTime(&t, start, stop);

In the above case, suppose I launched two kernels at t=0. I deliberately make the second kernel thread-intense so that it would be blocked by the first kernel and could not be launched at t=0. However, if I used cuda event like that, the state captured should be at the time right after kernel1 is launched, but not the state when kernel2 start running.

Agreed. In your original description, you did not mention streams and an expectation of concurrency. When you add streams to it, the method I proposed won’t do what you want. If both kernels are launched into the same stream, my method should be instructive.

There might not be a CUDA-runtime-API method to do what you want, although it might be possible with CUPTI:

If this is very important to you, you could use a flag reported in host-pinned memory, and have your host code poll the flag for execution state. The kernel code would set the flag state. See here for an example:

This is probably easier to get working correctly on windows TCC or linux. Windows WDDM mode may present some additional challenges due to command batching.