cudaDeviceSyncrhonize takes too long

When I measure execution time in certain parts of a function calling a kernel, such as cudaMalloc, cudaMemcpy from CPU to GPU, the kernel itself, cudaDeviceSynchronize, and memcpy back from GPU to CPU I find out that all parts except synchronization last e.g. 0.002042 sec and synchronization itself lasts 0.108922. Therefore i assume that execution time is ovelwhelmed by cudaDeviceSynchronize. Why is that? Is there a way to minimize synchronization time. I tried different values regarding the kernel grid size and block size. I found out that when syncrhonizatioin time decreases, the actual kernel time increases. Therefore I do not seem to find a way to decrease total duration.

It is impossible to tell with certainty from the cursory description provided, but it sounds like your code involves issuing work asynchronously, and that work hasn’t finished yet when the code execution reaches cudaDeviceSynchronize. As a consequence, the time you measure for cudaDeviceSynchronize reflects the time for the API call itself plus the time of all outstanding work it is waiting on to finish.

If so, you would want to update your measurement methodology. Or try the CUDA profiler if you haven’t already.