HI,
I have an application running in 5 milliseconds. This is fine of course, but I’d like to compress it, since timeline profiling shows large empty gaps :
Since I can’t upload a picture here, I’ll describe the timeline with words…
I first have 7 kernels launched, with the usual 20micro between them.
Then I have a huge white gap of 1 millisecond in the middle of which there is a tiny cudaMAlloc of 20micro.
After that application resumes with more kernels. And the pattern repeats a few more times.
How can I remove that gap?
PS : I work on stream 1 and never synchronize anything.
can you move the cudaMalloc to the initialization phase of your application, and re-use that piece of memory often? (allocate the maximum amount of memory that you expect your code to use)
Nope, I don’t know how much memory to allocate before this point.
size is returned by a call to nppiLabelMarkersGetBufferSize_16u_C1R
sequence of NPP calls are :
nppiLabelMarkers_16u_C1IR_Ctx
nppiCompressMarkerLabelsGetBufferSize_16u_C1R
– the malloc goes here
nppiCompressMarkerLabels_16u_C1IR_Ctx
You might want to take statistics of your buffer size that is routinely encountered (mean, distribution) and choose a buffer size that likely won’t be exceeded in 99.9% of all cases.
Then you only have to call cudaMalloc if your nppiCompressMarkerLabelsGetBufferSize() call returns a value that exceeds that chosen buffer size.
It’s a logic that can grow a buffer on demand, with a conservative initial size estimate.
Christian
great idea.
I still have some gap between markerLabels and compressMarkerLabels
I guess NPP is doing some CPU work.