My code flow looks like the following
cudaMemcpy( dst, src, size, cudaMemcpyDeviceToDevice );
cudaDeviceSynchronize();
KernelCall
The above flow doesn’t work all the time and fails almost for 50% of calls giving incorrect values in the device pointer causing my cuda kernel to fail.
However, when I add a sleep between cudaMemcpy and kernel call, it works fine all the time.
cudaMemcpy( dst, src, size, cudaMemcpyDeviceToDevice );
usleep(200000); // sleep for 0.2 seconds.
KernelCall
the dst and src are arrays if int32_t of size 117,000,000.
The GPU where this program fails is NVIDIA Tesla P40.
The same program works fine without usleep or cudaDeviceSynchronize when using NVIDIA GeForce GTX 1080 so it seems to be device specific but there can be other factors involved.
I did make sure the driver versions were same on both GPUs
I’m not sure why cudaDeviceSynchronize doesn’t work but usleep() works.
Also, whenever the program fails, I see the same value “1016296637” stored in all the locations on the device side.
Is is possible that cudaDeviceSynchronize doesn’t make CPU wait for the cudaMemcpy to finish?