cudaDeviceSynchronize() doesn't wait for cudaMemcpy to finish?

ashendre · February 16, 2021, 10:02pm

My code flow looks like the following
cudaMemcpy( dst, src, size, cudaMemcpyDeviceToDevice );
cudaDeviceSynchronize();
KernelCall

The above flow doesn’t work all the time and fails almost for 50% of calls giving incorrect values in the device pointer causing my cuda kernel to fail.

However, when I add a sleep between cudaMemcpy and kernel call, it works fine all the time.
cudaMemcpy( dst, src, size, cudaMemcpyDeviceToDevice );
usleep(200000); // sleep for 0.2 seconds.
KernelCall

the dst and src are arrays if int32_t of size 117,000,000.
The GPU where this program fails is NVIDIA Tesla P40.
The same program works fine without usleep or cudaDeviceSynchronize when using NVIDIA GeForce GTX 1080 so it seems to be device specific but there can be other factors involved.
I did make sure the driver versions were same on both GPUs

I’m not sure why cudaDeviceSynchronize doesn’t work but usleep() works.

Also, whenever the program fails, I see the same value “1016296637” stored in all the locations on the device side.

Is is possible that cudaDeviceSynchronize doesn’t make CPU wait for the cudaMemcpy to finish?

striker159 · February 17, 2021, 11:55am

Do you check for cuda runtime api errors?
Does cuda-memcheck report any errors?

ashendre · February 17, 2021, 6:51pm

I used cuda-memcheck and couldn’t find any errors with memcheck, racecheck or synccheck.
Initcheck gave me this issue
========= Host API memory access error at host access to 0x7fa4e2e274d8 of size 32792 bytes
========= Uninitialized access at 0x7fa4e2e2d500 on access by cudaMemcopy source.

njuffa · February 17, 2021, 8:24pm

You would want to fix the issues reported by cuda-memcheck right away.

Note that cudaMemcpy() is a synchronous API call: control returns to the host code after the copy is complete [see CUDA documentation]. A call to cudaDeviceSynchronize() is not needed here.

Topic		Replies	Views
cudaDeviceSynchronize() CUDA Programming and Performance	1	3172	September 21, 2017
Cuda 11.4: CUDA Programming and Performance	5	273	November 5, 2023
cudaDeviceSynchronize() returns cudaErrorMemoryAllocation CUDA Programming and Performance	1	518	February 2, 2018
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16375	September 29, 2017
Cuda 4 inter-GPU synchronization ? CUDA Programming and Performance	5	1072	April 5, 2011
Unable to understand the time unwanted time taken by cudaDeviceSynchronise() CUDA Programming and Performance tensorrt , cuda	1	368	April 12, 2022
Synchronization synchronizing a n body problem. CUDA Programming and Performance	8	4312	September 22, 2009
cudaThreadSynchronize vs. cudaDeviceSynchronize what is the difference? CUDA Programming and Performance	6	24334	June 26, 2011
A general question on Cuda Sync after kernal call CUDA Programming and Performance cuda	3	389	January 22, 2023
cudaDeviceSynchronize error CUDA Programming and Performance	2	3841	February 17, 2014

cudaDeviceSynchronize() doesn't wait for cudaMemcpy to finish?

Related topics