Very long D2H duration compared to H2D

I’m a beginner trying to understand more about cuda with my sample application. Below is the snippet where my kernels are invoked. I load data in, do some operations and reduce it to a single value which I copy back to the host. The memCopy back to the device tends to take around 15 uS but the memCopy from device to host takes far longer, typically around 70 uS. This is odd to me as the device to host only ever transfers one short whereas the host to device is moving over a thousand. I believe both memory addresses memory are pinned as I allocated them using cudaMallocHost.

According to discussions like this one H2D and D2H should be roughly the same speeds. Any ideas for why my D2H takes almost 5 times as long as H2D?

Code:

cudaMemcpy(gpu_permutation_data, input.data(), size_t(permutation_size) * size_t(rowsPerThread) * sizeof(keyEntry), cudaMemcpyHostToDevice);

construct(gpu_permutation_data, gpu_matrix_UTM, gpu_guide_construction, gpu_constant_permutationSize, matrix_size, rowsPerThread);

difference(gpu_matrix_UTM, gpu_matrix_base, matrix_size, rowsPerThread);

summation(gpu_matrix_UTM, gpu_row_sums, gpu_guide_summation, gpu_constant_matrixSize, gpu_constant_sumReductions, rowsPerThread, helper->summation_threads);

maxima(gpu_row_sums, gpu_guide_maxima, gpu_constant_matrixSize, gpu_constant_maxima, gpu_constant_maxReductions, helper->maxima_threads);

cudaMemcpy(&result, gpu_constant_maxima, sizeof(keyEntry), cudaMemcpyDeviceToHost);

Hard to see but this is from NSight profiler:

This is a fairly typical cycle, I do hundreds or more of these in a run and they almost all have similar timing.