cudaOccupancyMaxActiveBlocks returns the blocks by taking into acccount other co-running kernels?

Does “cudaOccupancyMaxActiveBlocksPerMultiprocessor” takes into account other kernels that run in the GPU. For instance, if a kernel runs in a separate stream and then I execute “cudaOccupancyMaxActiveBlocksPerMultiprocessor”, the returned blocks will be the remaining blocks, or all the blocks of the GPU?


It doesn’t take anything into account other than the compiled kernel, the type of device you are running on, and any parameters you pass to the function call itself.

1 Like

If it does not returns the available resources why it is executed at every cublas call? Moreover, I have noticed that it seems to have a lock inside. Because when I start a cudaLaunchKernel from a host thread and from another host thread I run cudaOccupancy, the duration of the cudaOccupancy increases by more than 10x. Why it requires the lock?

I don’t know that it requires a lock or why it requires a lock. You may wish to switch to CUDA 11.4 or later.

I am trying to understand why cudaLaunchKernel when it co-executes with cudaLaunchKernel or cudaOccupancy from different threads has much greater execution time. With one thread cudaLaunchKernel requires 6,7us, while when I am using 2 threads it requires more 40us (see the figures attached).

which CUDA version are you using?

which operating system?

This run is with CUDA 11.4 (as you mentioned previously), the same happens with CUDA 11.3. I am using CentOS 7.

The profiler can also add latency in this particular scenario (multithreaded kernel launch). You can assess the cpu thread overhead of the launch latency using standard host thread timing techniques. This is currently a limitation of the profiler as far as I know. I’m actually referring to Nsight systems, not nvvp as you are using here. nvvp is not recommended for turing devices and forward (turing, ampere, currently).

I know that the profiler adds overheads on cudaLaunchKernel, but the same duration is showed from the nvprof (with args -s --csv). Moreover, I see the same behavior at GPU utilization. My application creates N threads that issue kernels to N streams. Without cudaOccupancy, GPU utilization (from nvidia-smi) is 100% even if I increase the threads. However, when I add the cudaOccupancy (blue line in figs attached), GPU utilization decreases as the number of threads increases. You can see this in the figures attached.

My explanation is that cudaLaunch execution time increases, when it co-executes with another cudaLaunch or cudaOccupancy from another thread. This happens because there is a contention either in some kind of lock or to the kernel of the OS, because I have noticed from perf that cudaLaunch is a system call.

When I see the nvvp profiler it seems that with cudaOccupancy the kernels from different threads execute in lock step mode (one from thread0 and next one from thread1), as you can see in the fig bellow. However, when I remove cudaOccupancy, It seems that one thread sleeps and the other issues some kernels, then the sleeping thread wakes up and it issues some kernel (while the first sleeps) and so on. The result is that when the threads execute in lock step (scenario with cudaOccupancy) the cudaLaunch takes more time, so the GPU utilization decreases, because the GPU has less work to execute. In the case without cudaOccupancy, the cudaLaunch of the thread that issues the sequence of kernels, is small so the GPU has work to do and as a result its utilization is high.

My suggestion would be to develop the smallest/shortest possible complete test case, and then file a bug, I guess based on what I read here I would file the bug against the occupancy call disrupting the work issuance, but if you want to go after the increase in launch latency, by all means file a bug for that too.

I’m not sure if you are trying to add information here. nvprof is “the profiler”. nvvp uses nvprof under the hood, to do its work. And I repeat, those are not the recommended profilers for turing GPUs. So I would definitely reconfirm observations with the recommended profiler, first (with respect to increase in launch latency, not the occupancy question).

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.