Gap between some thread calls

alv07 · October 29, 2014, 9:56am

Hi!

I am using a Geforce770GTX and Cuda 6.0. I have been experiencing some performance issues and I would like to know if anyone knows what might be happening. When analyzing the performance (Visual Profiler) of some kernels calls I find that there is a gap repeating over time between them.

A couple of kernels are called one after the other in a for loop and a “big gap” (about 0.25ms) is happening time to time, for instance after 20 kernel executions. However, this issue does not happen with other kernels. I have been experiencing this issue with kernels that are quite fast (about 20us). Any guess?

Here is a screenshot that shows the issue

[url]https://drive.google.com/file/d/0B3mLPf1lM7HtaGhqT1dOUkMxT1E/view?usp=sharing[/url]

Thank you!

cbuchner1 · October 29, 2014, 1:49pm

To my untrained eye it would seem that the kernel launches are getting queued by the driver, and then sent off all at once. there is one cudaLaunch API call inbetween that seems to take a long time.

Are your kernels taking large amounts of texture references, or big kernel call parameter sets? These take some time to initialize and transfer.

Are you on Linux or Windows? can’t quite tell that from the screenshot you posted.

alv07 · October 29, 2014, 7:44pm

Hi cbuchner1, thanks for your reply

I was guessing something similar, it looks like the cudaLaunch API is sent in batches or it is too slow for calling the kernels. Each cudaLaunch block appears to have a duration of 25us, which is actually larger than the kernel itself, 20us. I dont know if it is a Visual Profiler issue or it makes sense.

Actually, if the kernel is faster, the gaps are also wider:
https://drive.google.com/file/d/0B3mLPf1lM7HtaHl6d1BPMjNHNmc/view?usp=sharing

The kernels are quite simple, no textures are used, and the parameters are just pointers and few integers. Here is the header:

__global__ void calculateRightOcclusionKernel(const unsigned char* disp, unsigned char* occ_map, int ncols, int nrows, int num_disp, int colStepOcc)

I am working on Windows 8.1 with the GeForce 320.52 driver

Thank you very much!

alv07 · October 29, 2014, 8:28pm

OK, just tried with an empty kernel and got the same behaviour.

Maybe I just have to design the kernels increasing their work to reduce the kernel launch overhead. For the experiments I have done, it seems like it worth launching a kernel if it is at least 25us.

Does anyone support this with a theoretical explanation?

Thanks!

Robert_Crovella · October 29, 2014, 11:06pm

It’s very likely a WDDM batching effect that you are observing.

In a WDDM system, the GPU is managed by the OS and the graphics driver must interact with the OS to schedule GPU activity. This presents considerable overhead, at least when compared to the behavior e.g. under linux or under a GPU managed as a TCC device under windows.

To mitigate the overhead, the NVIDIA GPU driver (in WDDM mode) will collect together sequences of activity intended for the GPU, and deliver them in batches, so that the cost of the communication overhead can be amortized over the batch.

I can’t give you detailed heuristics of the mechanism, but if you search around these forums you’ll find various discussions of WDDM batching including comments from Greg@NV who has given some descriptions, and methods to manipulate the behavior.

You may be able to tweak a specific interaction, but I’m doubtful that you can yield a net benefit over a variety of interactions from a busy app.

alv07 · October 30, 2014, 8:44am

Thank you txbob, great topic to consider. I didn’t know there were such a difference for CUDA between platforms. I am strongly considering to port the application to Linux… If I find something interesting on this topic I will write you back with the results.

Thank you guys for your help!

Cheers

alv07 · October 30, 2014, 10:21pm

Hi!

Just in case anyone is interested, I have just ported the application to linux Ubuntu 14.04 and this “gap” effect has totally dissapeared. Now kernels are lunched one after the other as they should. In windows my application performed one iteration in 0.48ms and in the same computer, same implementation, in linux obtained 0.35ms. That is quite an important performance boost in my opinion. As txbob noted, it was a WDDM issue.

Just in case anyone is interested in how Visual Profiler looks like now:
[url]https://drive.google.com/file/d/0B3mLPf1lM7HtQmZablFWd1VwcG8/view?usp=sharing[/url]

Cheers

Topic		Replies	Views
"idle time" between kernel calls ( from NVVP inspection) CUDA Programming and Performance	4	5171	December 10, 2012
First kernel execution takes longer CUDA Programming and Performance	8	2863	December 8, 2014
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2766	November 17, 2021
Reducing GPU Idle Time CUDA Programming and Performance	19	4434	June 14, 2022
Diff. between CPU / GPU kernel execution time CUDA Programming and Performance	4	1654	May 18, 2010
Very slow kernel launches CUDA Programming and Performance	8	7741	March 28, 2015
What the gaps on the nvvp pipeline mean? And how to shrink the gap size? CUDA Programming and Performance	6	748	September 15, 2019
How to explain the performance difference? CUDA Programming and Performance	7	3506	March 26, 2008
Inconsistent kernel run times CUDA Programming and Performance	12	5796	August 5, 2009
Large gaps between API calls and execution CUDA Programming and Performance	4	952	December 13, 2019

Gap between some thread calls

Related topics