`cudaMemcpyHostToDevice` is very slow

AndrewGong · December 8, 2018, 8:56am

Hello, here I have a input array hostInputBuffer storing a image of 800x1376 pixels and a output array output_array will store the output value of a marked output node for tensorrt inference . When I infer a engine with tensorrt created using a resnet-50 net, I find, for time consumption, the operation of cudaMemcpyAsync(buffers[inputIndex], hostInputBuffer, BATCH_SIZE * INPUT_H * INPUT_W * 3 * sizeof(float), cudaMemcpyHostToDevice, stream) is fast, about only 1.5 ms, but the operation of cudaMemcpyAsync(score, buffers[outputIndex], BATCH_SIZE * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream) is very slow, about 40ms.
here, INPUT_H=1376, INPUT_W=800, BATCH_SIZE=1, OUTPUT_SIZE=INPUT_H/4*INPUT_W/4

are there another methods to speed up the speed of cudaMemcpyDeviceToHost ?

Linux distro and version:

LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.4.1708 (Core)
Release:	7.4.1708
Codename:	Core

other envirs:

GPU type: Tesla v100
nvidia driver version: NVIDIA-SMI 396.44
CUDA version: 9.0
CUDNN version: 7.3.0
Python version [if using python]: python2.7
TensorRT version: 5.0.2.6
tensorflow-gpu:1.4.1
gcc>5.3/lib64

NVES · December 8, 2018, 4:36pm

Moving topic to CUDA performance for support coverage.

Robert_Crovella · December 8, 2018, 4:49pm

The device->host copy is probably coming after a kernel call, and the time measurement is absorbing the time of the kernel call.

Either that, or your OUTPUT_SIZE is ~30x larger than your INPUT_HINPUT_W3, which doesn’t seem likely.

njuffa · December 8, 2018, 5:47pm

What measurement methodology was used? The CUDA profiler should be able to separate out time for kernel execution from time for the async copies.

To isolate copies from kernel execution when using manual measurements of some kind, insert a call to cudaDeviceSynchronize() just before the timed portion of the code.

In general, for fast copies between host/device, make sure a PCIe gen3 x16 link is used for the GPU, which allows for the transfer of 12.0 - 12.5 GB/sec in either direction for large transfers such as this one (~12MB based on the information provided).

AndrewGong · December 9, 2018, 6:27am

@NVES thanks.

@Robert_Crovella so there is no method to speed up the time of coping data from device to host?

@njuffa I do not understand your meaning. the tensorrt example is as follow:

... // create engine
        // create context
        IExecutionContext* context = engine->createExecutionContext();

        // create GPU buffers
        buffers=new void*[2];
        ... 
        int INPUT_H=1376, INPUT_W=800; 
        int BATCH_SIZE=1;
        int OUTPUT_SIZE=INPUT_H/4*INPUT_W/4;
        float* output_array = new float[BATCH_SIZE*OUTPUT_SIZE];
        cudaMalloc(&buffers[0], BATCH_SIZE *3* INPUT_H * INPUT_W * sizeof(float));
        cudaMalloc(&buffers[1], BATCH_SIZE * OUTPUT_SIZE * sizeof(float));

        // create stream and run inference
        cudaStream_t stream;
        cudaStreamCreate(&stream);

        high_resolution_clock::time_point t1 = high_resolution_clock::now();
        cudaMemcpyAsync(buffers[inputIndex], hostInputBuffer, BATCH_SIZE *3* INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream);
        high_resolution_clock::time_point t2 = high_resolution_clock::now();
        duration<double, std::milli> time_span_2 = t2 - t1;
        std::cout << "cudaMemcpyAsync(buffers[inputIndex], it took me total " << time_span_2.count() << " milliseconds.\n";
        context->enqueue(BATCH_SIZE, buffers, stream, nullptr);
        high_resolution_clock::time_point t3 = high_resolution_clock::now();
        duration<double, std::milli> time_span_3 = t3 - t2;
        std::cout << "context->enqueue(BATCH_SIZE, it took me total " << time_span_3.count() << " milliseconds.\n";
        cudaMemcpyAsync(output_array, buffers[outputIndex], BATCH_SIZE * OUTPUT_SIZE*sizeof(float), cudaMemcpyDeviceToHost, stream);
        high_resolution_clock::time_point t4 = high_resolution_clock::now();
        duration<double, std::milli> time_span_4 = t4 - t3;
        std::cout << "cudaMemcpyAsync(output, it took me total " << time_span_4.count() << " milliseconds.\n";
        cudaStreamSynchronize(stream);
        high_resolution_clock::time_point t5 = high_resolution_clock::now();
        duration<double, std::milli> time_span_5 = t5 - t1;
        std::cout << "infer one, it took me total " << time_span_5.count() << " milliseconds.\n";

njuffa · December 9, 2018, 6:45am

I don’t see anything in the code snippet that measures the timing data reported in the original post. How were those execution times determined?

My working hypothesis is that there is no issue with slow copies from device to host, but that there is an issue with how the reported times (1.5ms, 40ms) were measured.

Can you states, for the record, exactly how much data was copied for the two transfers in question? I don’t see any definitions for BATCH_SIZE, INPUT_H, INPUT_W, OUTPUT_SIZE.

AndrewGong · December 12, 2018, 8:08am

@njuffa thanks for your reply.

I use high_resolution_clock to count the time consuming. And I set:

int INPUT_H=1376, INPUT_W=800; 
int BATCH_SIZE=1;
int OUTPUT_SIZE=INPUT_H/4*INPUT_W/4;

I update my post. Please check again.

looking for your advises.

Robert_Crovella · December 12, 2018, 2:50pm

since GPU kernel launches are asynchronous, I believe your timing measurement is absorbing GPU kernel work previously launched.

I think you may be confused about what you are measuring. You’ve now updated the code to something that can be discussed, but haven’t shown the actual output, ie. the timing measurements reported by that code.

since output_array is not in pinned host memory, the cudaMemcpyAsync operation there will convert to an ordinary cudaMemcpy operation, i.e. it will be blocking. This will certainly cause that measurement to absorb all previous asynchronous activity.

I have a strong suspicion you don’t understand the nature of cuda asynchronous execution as well as the implications for timing, as well the general behavior of cudaMemcpyAsync.

This recent thread may be of interest:

[url]https://devtalk.nvidia.com/default/topic/1045138/cuda-programming-and-performance/is-there-any-way-to-copy-data-from-device-to-host-more-efficiently-in-this-case-/[/url]

AndrewGong · December 14, 2018, 3:02am

@Robert_Crovella
thanks for your reply. After I study some materials about cuda stream. Now I know how to measure the time of cudaMemcpyAsync operation(like using cudaEvent_t).

thank you very much~

Topic		Replies	Views
Fast copy (device->kernel) after aync kernel call CUDA Programming and Performance	11	3215	July 6, 2015
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2800	June 24, 2016
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15135	April 29, 2014
bets way to return a float value sync or assync CUDA Programming and Performance	26	10310	May 7, 2009
Unable to understand the time unwanted time taken by cudaDeviceSynchronise() CUDA Programming and Performance tensorrt , cuda	1	343	April 12, 2022
Is there any way to copy data from device to host more efficiently in this case? CUDA Programming and Performance	4	887	December 14, 2018
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	93	August 22, 2024
Cuda transfer from device to host is extremely slow TensorRT cuda	5	2512	February 13, 2022
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Perfomance issue with memcpy TensorRT	3	632	January 28, 2020

`cudaMemcpyHostToDevice` is very slow

Related topics