Profiling -- large empty gaps when doing a memcpy


I have an application running in 5 milliseconds. This is fine of course, but I’d like to compress it, since timeline profiling shows large empty gaps :

Since I can’t upload a picture here, I’ll describe the timeline with words…

I first have 7 kernels launched, with the usual 20micro between them.

Then I have a huge white gap of 1 millisecond in the middle of which there is a tiny cudaMAlloc of 20micro.

After that application resumes with more kernels. And the pattern repeats a few more times.

How can I remove that gap?

PS : I work on stream 1 and never synchronize anything.

can you move the cudaMalloc to the initialization phase of your application, and re-use that piece of memory often? (allocate the maximum amount of memory that you expect your code to use)

Nope, I don’t know how much memory to allocate before this point.

size is returned by a call to nppiLabelMarkersGetBufferSize_16u_C1R

sequence of NPP calls are :


– the malloc goes here


You might want to take statistics of your buffer size that is routinely encountered (mean, distribution) and choose a buffer size that likely won’t be exceeded in 99.9% of all cases.

Then you only have to call cudaMalloc if your nppiCompressMarkerLabelsGetBufferSize() call returns a value that exceeds that chosen buffer size.

It’s a logic that can grow a buffer on demand, with a conservative initial size estimate.


great idea.

I still have some gap between markerLabels and compressMarkerLabels
I guess NPP is doing some CPU work.