Performance measurement


I’ve created several kernel functions and measured the time. I’m interested in the overhead CUDA generates.

I analysed a kernel function with the built in nsight analyzer. It says the duration of my kernel is 3.2 microseconds. When I measure with the QueryPerformanceCounter around the kernel function and call cudaThreadSynchronize() to make sure, that the kernel execution has finished, I get like 230 microseconds.
Does this mean that there’s an overhead of approximately 200 microseconds? This would render CUDA inefficient for small junks of calculation to be done. Has anyone else tried to use CUDA for a small amounts of “work”.


Hi I have no idea whether your measurements are correct and what the actual times are, however, 200us overhead sounds absolutely reasonable to me. It a well known fact that GPGPU computing is not suited for small jobs.

What operating system are you using? The launch overhead varies between operating systems quite a bit.