cudaDeviceSyncrhonize takes too long

pbassia · April 28, 2020, 7:18pm

When I measure execution time in certain parts of a function calling a kernel, such as cudaMalloc, cudaMemcpy from CPU to GPU, the kernel itself, cudaDeviceSynchronize, and memcpy back from GPU to CPU I find out that all parts except synchronization last e.g. 0.002042 sec and synchronization itself lasts 0.108922. Therefore i assume that execution time is ovelwhelmed by cudaDeviceSynchronize. Why is that? Is there a way to minimize synchronization time. I tried different values regarding the kernel grid size and block size. I found out that when syncrhonizatioin time decreases, the actual kernel time increases. Therefore I do not seem to find a way to decrease total duration.

njuffa · September 9, 2020, 11:17pm

It is impossible to tell with certainty from the cursory description provided, but it sounds like your code involves issuing work asynchronously, and that work hasn’t finished yet when the code execution reaches cudaDeviceSynchronize. As a consequence, the time you measure for cudaDeviceSynchronize reflects the time for the API call itself plus the time of all outstanding work it is waiting on to finish.

If so, you would want to update your measurement methodology. Or try the CUDA profiler if you haven’t already.

Topic		Replies	Views
Unable to understand the time unwanted time taken by cudaDeviceSynchronise() CUDA Programming and Performance tensorrt , cuda	1	342	April 12, 2022
cudaDeviceSynchronize is very slow CUDA Programming and Performance	1	2070	July 31, 2014
is cudaThreadSynchronize() will take 600+ms to execute? CUDA Programming and Performance	3	1538	April 21, 2009
What determines the amount of time spent on my `cudaSynchronize` call? CUDA Programming and Performance	1	1103	February 21, 2019
cudaDeviceSynchronize() doesn't wait for cudaMemcpy to finish? CUDA Programming and Performance cuda , synchronization	3	2871	February 17, 2021
Cuda 11.4: CUDA Programming and Performance	5	266	November 5, 2023
Getting diff time statistics for same function Totally confused after seeing results CUDA Programming and Performance	3	4179	December 4, 2007
A general question on Cuda Sync after kernal call CUDA Programming and Performance cuda	3	381	January 22, 2023
Copy back to host lasts much longer than copy to device, why? CUDA Programming and Performance	3	677	December 11, 2013
Getting data from GPU to CPU without blocking calls CUDA Programming and Performance cuda	1	741	June 18, 2020

cudaDeviceSyncrhonize takes too long

Related topics