`cudaMemcpyHostToDevice` is very slow

Hello, here I have a input array hostInputBuffer storing a image of 800x1376 pixels and a output array output_array will store the output value of a marked output node for tensorrt inference . When I infer a engine with tensorrt created using a resnet-50 net, I find, for time consumption, the operation of cudaMemcpyAsync(buffers[inputIndex], hostInputBuffer, BATCH_SIZE * INPUT_H * INPUT_W * 3 * sizeof(float), cudaMemcpyHostToDevice, stream) is fast, about only 1.5 ms, but the operation of cudaMemcpyAsync(score, buffers[outputIndex], BATCH_SIZE * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream) is very slow, about 40ms.
here, INPUT_H=1376, INPUT_W=800, BATCH_SIZE=1, OUTPUT_SIZE=INPUT_H/4*INPUT_W/4

are there another methods to speed up the speed of cudaMemcpyDeviceToHost ?

Linux distro and version:

LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.4.1708 (Core)
Release:	7.4.1708
Codename:	Core

other envirs:

GPU type: Tesla v100
nvidia driver version: NVIDIA-SMI 396.44
CUDA version: 9.0
CUDNN version: 7.3.0
Python version [if using python]: python2.7
TensorRT version: 5.0.2.6
tensorflow-gpu:1.4.1
gcc>5.3/lib64

Moving topic to CUDA performance for support coverage.

The device->host copy is probably coming after a kernel call, and the time measurement is absorbing the time of the kernel call.

Either that, or your OUTPUT_SIZE is ~30x larger than your INPUT_HINPUT_W3, which doesn’t seem likely.

What measurement methodology was used? The CUDA profiler should be able to separate out time for kernel execution from time for the async copies.

To isolate copies from kernel execution when using manual measurements of some kind, insert a call to cudaDeviceSynchronize() just before the timed portion of the code.

In general, for fast copies between host/device, make sure a PCIe gen3 x16 link is used for the GPU, which allows for the transfer of 12.0 - 12.5 GB/sec in either direction for large transfers such as this one (~12MB based on the information provided).

@NVES thanks.

@Robert_Crovella so there is no method to speed up the time of coping data from device to host?

@njuffa I do not understand your meaning. the tensorrt example is as follow:

... // create engine
        // create context
        IExecutionContext* context = engine->createExecutionContext();

        // create GPU buffers
        buffers=new void*[2];
        ... 
        int INPUT_H=1376, INPUT_W=800; 
        int BATCH_SIZE=1;
        int OUTPUT_SIZE=INPUT_H/4*INPUT_W/4;
        float* output_array = new float[BATCH_SIZE*OUTPUT_SIZE];
        cudaMalloc(&buffers[0], BATCH_SIZE *3* INPUT_H * INPUT_W * sizeof(float));
        cudaMalloc(&buffers[1], BATCH_SIZE * OUTPUT_SIZE * sizeof(float));

        // create stream and run inference
        cudaStream_t stream;
        cudaStreamCreate(&stream);

        high_resolution_clock::time_point t1 = high_resolution_clock::now();
        cudaMemcpyAsync(buffers[inputIndex], hostInputBuffer, BATCH_SIZE *3* INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream);
        high_resolution_clock::time_point t2 = high_resolution_clock::now();
        duration<double, std::milli> time_span_2 = t2 - t1;
        std::cout << "cudaMemcpyAsync(buffers[inputIndex], it took me total " << time_span_2.count() << " milliseconds.\n";
        context->enqueue(BATCH_SIZE, buffers, stream, nullptr);
        high_resolution_clock::time_point t3 = high_resolution_clock::now();
        duration<double, std::milli> time_span_3 = t3 - t2;
        std::cout << "context->enqueue(BATCH_SIZE, it took me total " << time_span_3.count() << " milliseconds.\n";
        cudaMemcpyAsync(output_array, buffers[outputIndex], BATCH_SIZE * OUTPUT_SIZE*sizeof(float), cudaMemcpyDeviceToHost, stream);
        high_resolution_clock::time_point t4 = high_resolution_clock::now();
        duration<double, std::milli> time_span_4 = t4 - t3;
        std::cout << "cudaMemcpyAsync(output, it took me total " << time_span_4.count() << " milliseconds.\n";
        cudaStreamSynchronize(stream);
        high_resolution_clock::time_point t5 = high_resolution_clock::now();
        duration<double, std::milli> time_span_5 = t5 - t1;
        std::cout << "infer one, it took me total " << time_span_5.count() << " milliseconds.\n";

I don’t see anything in the code snippet that measures the timing data reported in the original post. How were those execution times determined?

My working hypothesis is that there is no issue with slow copies from device to host, but that there is an issue with how the reported times (1.5ms, 40ms) were measured.

Can you states, for the record, exactly how much data was copied for the two transfers in question? I don’t see any definitions for BATCH_SIZE, INPUT_H, INPUT_W, OUTPUT_SIZE.

@njuffa thanks for your reply.

I use high_resolution_clock to count the time consuming. And I set:

int INPUT_H=1376, INPUT_W=800; 
int BATCH_SIZE=1;
int OUTPUT_SIZE=INPUT_H/4*INPUT_W/4;

I update my post. Please check again.

looking for your advises.

since GPU kernel launches are asynchronous, I believe your timing measurement is absorbing GPU kernel work previously launched.

I think you may be confused about what you are measuring. You’ve now updated the code to something that can be discussed, but haven’t shown the actual output, ie. the timing measurements reported by that code.

since output_array is not in pinned host memory, the cudaMemcpyAsync operation there will convert to an ordinary cudaMemcpy operation, i.e. it will be blocking. This will certainly cause that measurement to absorb all previous asynchronous activity.

I have a strong suspicion you don’t understand the nature of cuda asynchronous execution as well as the implications for timing, as well the general behavior of cudaMemcpyAsync.

This recent thread may be of interest:

https://devtalk.nvidia.com/default/topic/1045138/cuda-programming-and-performance/is-there-any-way-to-copy-data-from-device-to-host-more-efficiently-in-this-case-/

@Robert_Crovella
thanks for your reply. After I study some materials about cuda stream. Now I know how to measure the time of cudaMemcpyAsync operation(like using cudaEvent_t).

thank you very much~