cudaOccupancyMaxActiveBlocks returns the blocks by taking into acccount other co-running kernels?

manospavlidakis · July 27, 2021, 8:37am

Does “cudaOccupancyMaxActiveBlocksPerMultiprocessor” takes into account other kernels that run in the GPU. For instance, if a kernel runs in a separate stream and then I execute “cudaOccupancyMaxActiveBlocksPerMultiprocessor”, the returned blocks will be the remaining blocks, or all the blocks of the GPU?

Robert_Crovella · July 27, 2021, 2:29pm

No.

It doesn’t take anything into account other than the compiled kernel, the type of device you are running on, and any parameters you pass to the function call itself.

manospavlidakis · July 28, 2021, 6:43am

If it does not returns the available resources why it is executed at every cublas call? Moreover, I have noticed that it seems to have a lock inside. Because when I start a cudaLaunchKernel from a host thread and from another host thread I run cudaOccupancy, the duration of the cudaOccupancy increases by more than 10x. Why it requires the lock?

Robert_Crovella · July 28, 2021, 10:32am

I don’t know that it requires a lock or why it requires a lock. You may wish to switch to CUDA 11.4 or later.

manospavlidakis · July 28, 2021, 3:34pm

I am trying to understand why cudaLaunchKernel when it co-executes with cudaLaunchKernel or cudaOccupancy from different threads has much greater execution time. With one thread cudaLaunchKernel requires 6,7us, while when I am using 2 threads it requires more 40us (see the figures attached).

Robert_Crovella · July 28, 2021, 3:43pm

which CUDA version are you using?

which operating system?

manospavlidakis · July 28, 2021, 5:02pm

This run is with CUDA 11.4 (as you mentioned previously), the same happens with CUDA 11.3. I am using CentOS 7.

Robert_Crovella · July 28, 2021, 7:19pm

The profiler can also add latency in this particular scenario (multithreaded kernel launch). You can assess the cpu thread overhead of the launch latency using standard host thread timing techniques. This is currently a limitation of the profiler as far as I know. I’m actually referring to Nsight systems, not nvvp as you are using here. nvvp is not recommended for turing devices and forward (turing, ampere, currently).

manospavlidakis · July 29, 2021, 6:48am

I know that the profiler adds overheads on cudaLaunchKernel, but the same duration is showed from the nvprof (with args -s --csv). Moreover, I see the same behavior at GPU utilization. My application creates N threads that issue kernels to N streams. Without cudaOccupancy, GPU utilization (from nvidia-smi) is 100% even if I increase the threads. However, when I add the cudaOccupancy (blue line in figs attached), GPU utilization decreases as the number of threads increases. You can see this in the figures attached.

My explanation is that cudaLaunch execution time increases, when it co-executes with another cudaLaunch or cudaOccupancy from another thread. This happens because there is a contention either in some kind of lock or to the kernel of the OS, because I have noticed from perf that cudaLaunch is a system call.

When I see the nvvp profiler it seems that with cudaOccupancy the kernels from different threads execute in lock step mode (one from thread0 and next one from thread1), as you can see in the fig bellow. However, when I remove cudaOccupancy, It seems that one thread sleeps and the other issues some kernels, then the sleeping thread wakes up and it issues some kernel (while the first sleeps) and so on. The result is that when the threads execute in lock step (scenario with cudaOccupancy) the cudaLaunch takes more time, so the GPU utilization decreases, because the GPU has less work to execute. In the case without cudaOccupancy, the cudaLaunch of the thread that issues the sequence of kernels, is small so the GPU has work to do and as a result its utilization is high.

Robert_Crovella · July 29, 2021, 8:30pm

My suggestion would be to develop the smallest/shortest possible complete test case, and then file a bug, I guess based on what I read here I would file the bug against the occupancy call disrupting the work issuance, but if you want to go after the increase in launch latency, by all means file a bug for that too.

I’m not sure if you are trying to add information here. nvprof is “the profiler”. nvvp uses nvprof under the hood, to do its work. And I repeat, those are not the recommended profilers for turing GPUs. So I would definitely reconfirm observations with the recommended profiler, first (with respect to increase in launch latency, not the occupancy question).

system · September 27, 2021, 8:30pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
low concurrency and low kernel utilization, but kernels are filled. CUDA Programming and Performance	6	1410	November 18, 2018
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3074	January 19, 2018
[SOLVED] Concurrent Kernel Execution CUDA Programming and Performance	7	5900	May 21, 2016
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4631	February 15, 2018
Question about NVVP results: GPU's SMs and cores during CUDA kernel execution CUDA Programming and Performance	1	1205	April 13, 2019
Launch Parameters for Large Problems CUDA Programming and Performance cuda , kernel	13	2005	October 12, 2021
[ON HOLD] Issue with cuda_occupancy and cudaDeviceSetCacheConfig(...) CUDA Programming and Performance	7	2312	June 26, 2018
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	1787	December 19, 2023
100% CPU usage when running CUDA code CUDA Programming and Performance	5	4964	October 31, 2023
CUDA Pro Tip: Occupancy API Simplifies Launch Configuration Technical Blog	12	687	February 21, 2017

cudaOccupancyMaxActiveBlocks returns the blocks by taking into acccount other co-running kernels?

Related topics