Unable to understand the time unwanted time taken by cudaDeviceSynchronise()

tiger.thribhu · April 12, 2022, 7:59am

My application is written to do inference for a custom segmentation model. When I profiled the application using Nsight Systems, I found that the cudaMemCpy for device to host was taking around ~33ms whereas the host to device copy was under 1ms. I read some articles which suggested the use of cudaDeviceSynchronise() to seperate the kernel execution time. When I added that to the my code, the cudaMemCpy for device to host greatly reduced to be <1ms but the cudaDeviceSynchronise() was taking up almost 33ms now.

Here’s the timeline view after adding cudaDeviceSynchronise():

Here’s the code:

cv::minMaxLoc(cpu_img, &minVal, &maxVal, &minLoc, &maxLoc);
cpu_img.convertTo(cpu_img, CV_32FC1);
cpu_img = cpu_img / float(maxVal);
cudaMemcpyAsync(buffers[0], cpu_img.data, sizeof(float) * dl_input_size * dl_input_size, cudaMemcpyHostToDevice, stream);
context->enqueueV2(buffers.data(), stream, nullptr);
cudaDeviceSynchronise();
std::vector<float> cpu_output(getSizeByDim(dims[0]) * batch_size);
cudaMemcpyAsync(cpu_output.data(), (float*)gpu_output[1], cpu_output.size() * sizeof(float), cudaMemcpyDeviceToHost, stream);
std::vector<float> cpu_output_landmark(getSizeByDim(dims[1]) * batch_size);
cudaMemcpyAsync(cpu_output_landmark.data(), (float*)gpu_output[2], cpu_output_landmark.size() * sizeof(float), cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize(stream);

Robert_Crovella · April 12, 2022, 12:56pm

There is no call like that. I guess that is not actually your code. It is spelled with a z not an s.

To understand the time that is now taken by cudaDeviceSynchronize(), you need to understand that the work you issued to the GPU does not execute in zero time. It takes some time for the work to execute, and since the GPU executes asynchronously, the time of execution appears at the cudaDeviceSynchronize() call.

The work you are asking the GPU to do here:

context->enqueueV2(buffers.data(), stream, nullptr);

takes time. It does not execute in zero time. If you immediately follow that call with cudaDeviceSynchronize(), then that is telling the host CPU thread “wait here until the previously issued work to the GPU has finished” and so it waits there, and you see the execution time of the inference request at that point.

Topic		Replies	Views
cudaDeviceSyncrhonize takes too long CUDA Programming and Performance	1	769	September 9, 2020
Cuda 11.4: CUDA Programming and Performance	5	272	November 5, 2023
cudaDeviceSynchronize is very slow CUDA Programming and Performance	1	2101	July 31, 2014
cudaThreadSynchronize() and timing question CUDA Programming and Performance	7	8249	October 27, 2010
cudaDeviceSynchronize 50x slower on TK1 Jetson TK1	2	992	August 7, 2016
cudaDeviceSynchronize() doesn't wait for cudaMemcpy to finish? CUDA Programming and Performance cuda , synchronization	3	3005	February 17, 2021
Do i really need to use cudaDeviceSynchronize in this scenario ? CUDA Programming and Performance	2	1033	February 11, 2019
CUDA beginner: understanding the workflow of CUDA kernels and cudaDeviceSynchronize() CUDA Programming and Performance	0	808	November 27, 2017
What determines the amount of time spent on my `cudaSynchronize` call? CUDA Programming and Performance	1	1122	February 21, 2019
Kernel function calls in regards to cudaSynchronizeDevice(); CUDA Programming and Performance	2	657	May 25, 2017

Unable to understand the time unwanted time taken by cudaDeviceSynchronise()

Related topics