I’ve created several kernel functions and measured the time. I’m interested in the overhead CUDA generates.
I analysed a kernel function with the built in nsight analyzer. It says the duration of my kernel is 3.2 microseconds. When I measure with the QueryPerformanceCounter around the kernel function and call cudaThreadSynchronize() to make sure, that the kernel execution has finished, I get like 230 microseconds.
Does this mean that there’s an overhead of approximately 200 microseconds? This would render CUDA inefficient for small junks of calculation to be done. Has anyone else tried to use CUDA for a small amounts of “work”.