Very long D2H duration compared to H2D

aven_omega · January 9, 2022, 3:28am

I’m a beginner trying to understand more about cuda with my sample application. Below is the snippet where my kernels are invoked. I load data in, do some operations and reduce it to a single value which I copy back to the host. The memCopy back to the device tends to take around 15 uS but the memCopy from device to host takes far longer, typically around 70 uS. This is odd to me as the device to host only ever transfers one short whereas the host to device is moving over a thousand. I believe both memory addresses memory are pinned as I allocated them using cudaMallocHost.

According to discussions like this one H2D and D2H should be roughly the same speeds. Any ideas for why my D2H takes almost 5 times as long as H2D?

Code:

cudaMemcpy(gpu_permutation_data, input.data(), size_t(permutation_size) * size_t(rowsPerThread) * sizeof(keyEntry), cudaMemcpyHostToDevice);

construct(gpu_permutation_data, gpu_matrix_UTM, gpu_guide_construction, gpu_constant_permutationSize, matrix_size, rowsPerThread);

difference(gpu_matrix_UTM, gpu_matrix_base, matrix_size, rowsPerThread);

summation(gpu_matrix_UTM, gpu_row_sums, gpu_guide_summation, gpu_constant_matrixSize, gpu_constant_sumReductions, rowsPerThread, helper->summation_threads);

maxima(gpu_row_sums, gpu_guide_maxima, gpu_constant_matrixSize, gpu_constant_maxima, gpu_constant_maxReductions, helper->maxima_threads);

cudaMemcpy(&result, gpu_constant_maxima, sizeof(keyEntry), cudaMemcpyDeviceToHost);

Hard to see but this is from NSight profiler:

This is a fairly typical cycle, I do hundreds or more of these in a run and they almost all have similar timing.

aven_omega · February 16, 2022, 1:47pm

For anyone else who has this issue - it turns out that the D2H wasn’t actually longer, it was just reporting on host times. The memory transfer API call starts only after all preceding kernels on the default stream finish, but the host will start the timer for the API as soon as it is issued from the host, before the kernels have actually finished.

system · March 2, 2022, 1:48pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.