Hello, here I have a input array hostInputBuffer storing a image of 800x1376 pixels and a output array output_array will store the output value of a marked output node for tensorrt inference . When I infer a engine with tensorrt created using a resnet-50 net, I find, for time consumption, the operation of cudaMemcpyAsync(buffers[inputIndex], hostInputBuffer, BATCH_SIZE * INPUT_H * INPUT_W * 3 * sizeof(float), cudaMemcpyHostToDevice, stream) is fast, about only 1.5 ms, but the operation of cudaMemcpyAsync(score, buffers[outputIndex], BATCH_SIZE * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream) is very slow, about 40ms.
here, INPUT_H=1376, INPUT_W=800, BATCH_SIZE=1, OUTPUT_SIZE=INPUT_H/4*INPUT_W/4
are there another methods to speed up the speed of cudaMemcpyDeviceToHost ?
What measurement methodology was used? The CUDA profiler should be able to separate out time for kernel execution from time for the async copies.
To isolate copies from kernel execution when using manual measurements of some kind, insert a call to cudaDeviceSynchronize() just before the timed portion of the code.
In general, for fast copies between host/device, make sure a PCIe gen3 x16 link is used for the GPU, which allows for the transfer of 12.0 - 12.5 GB/sec in either direction for large transfers such as this one (~12MB based on the information provided).
I don’t see anything in the code snippet that measures the timing data reported in the original post. How were those execution times determined?
My working hypothesis is that there is no issue with slow copies from device to host, but that there is an issue with how the reported times (1.5ms, 40ms) were measured.
Can you states, for the record, exactly how much data was copied for the two transfers in question? I don’t see any definitions for BATCH_SIZE, INPUT_H, INPUT_W, OUTPUT_SIZE.
since GPU kernel launches are asynchronous, I believe your timing measurement is absorbing GPU kernel work previously launched.
I think you may be confused about what you are measuring. You’ve now updated the code to something that can be discussed, but haven’t shown the actual output, ie. the timing measurements reported by that code.
since output_array is not in pinned host memory, the cudaMemcpyAsync operation there will convert to an ordinary cudaMemcpy operation, i.e. it will be blocking. This will certainly cause that measurement to absorb all previous asynchronous activity.
I have a strong suspicion you don’t understand the nature of cuda asynchronous execution as well as the implications for timing, as well the general behavior of cudaMemcpyAsync.
@Robert_Crovella
thanks for your reply. After I study some materials about cuda stream. Now I know how to measure the time of cudaMemcpyAsync operation(like using cudaEvent_t).