Hi,
I used Nsight Systems to track kernel execution times for doing some consurrency analysis. As you can see in the figure below, it seems that 4 kernels are serialized.
The questions are:
1- To be sure that I am looking at right information, in the case of concurrency, I expect to see some overlaps in the timeline. Am I right?
2- Has this serialization been decided by the CUDA runtime driver although there is no strict serialization in the code? Does NVCC add/remove some features regarding concurrent kernel execution? Assuming the code and the compilation command are available, is there anyway to apply/forbid concurrent execution?
Yes. I don’t see any evidence of kernel concurrency in your nsight systems output. There is no overlap of the kernel bars, that I can easily see.
The topic of kernel concurrency has been covered in many places, and there is even a concurrentkernels cuda sample code that demonstrates how to witness it.
Some of the necessary requirements to witness kernel concurrency is that the kernels must be launched into separate streams (at approximately the same time in your application) and for kernels to execute concurrently, they must not fully utilize GPU resources in any way. Most likely, if you are not witnessing kernel concurrency, you have violated one of these ideas. Based on studying the excerpt of nsight systems output you have provided, it might be that you have not properly used streams. The full nsight systems timeline view would allow one to easily confirm or refute that idea.
This is not enabled or disabled through any switches you might normally pass to nvcc. There might be some exceptions to that statement regarding specification of the default stream, but that would probably be a corner case. Answering such questions with no code is necessarily somewhat imprecise; its hard to cover every possible scenario with general statements.