Overhead using cudaMemcpyAsync

Joky · August 27, 2009, 2:43pm

Hello !

I’m trying to stream my app. I have to launch the same kernel many times (few hundred to few thousand) on differents inputs. Before each launch I have to send 2 chunks of data (first is between a few kB and 3MB and second is 195kB). The kernel doesn’t do many computations and it’s very small. Basically it compute 5 memory reads, 1 mul, 3 add, and a final memory write.

I thought my problem would perfectly fit in streaming, however I have worst performance with than without streaming.

Of course memory is pinned with cudaMallocHost, and device is able to overlap : it’s a GTX295. Thinking about an overhead with cudaMemcpyAsync and small chunk of data, I just modified “n” var in simpleStreams SDK example to check that. As a result, streaming was always slower with less than 768000 int (3MB data).

Here is an extract of the output :

-------------------------------										  

> array_size   = 716800												  

memcopy:		0.55													 

kernel:		 0.88													 

non-streamed:   1.37 (1.44 expected)									 

4 streams:	  1.42 (1.02 expected with compute capability 1.1 or later)

-------------------------------										  

> array_size   = 768000												  

memcopy:		0.59													 

kernel:		 0.95													 

non-streamed:   1.46 (1.55 expected)									 

4 streams:	  1.45 (1.10 expected with compute capability 1.1 or later)

-------------------------------										  

> array_size   = 819200												  

memcopy:		0.63													 

kernel:		 1.01													 

non-streamed:   1.55 (1.63 expected)									 

4 streams:	  1.51 (1.17 expected with compute capability 1.1 or later)

-------------------------------

Well, can anyone confirm my experiments : cudaMemcpyAsync has to much overhead to allow stream to be used with small (<3-4MB) chunk of data ?

May I expect better result with zero-copy ?

CapJo · August 27, 2009, 6:32pm

I don’t know the exact overhead of the async copy, but I get good results with this architecture for video processing.

I have 4 Buffer, one input buffer (d_data1), the gpu is working/reading from d_data2 and writing the result to d_data3.

The result of the computation is copied from d_data4 to the host memory. Afterwards I switch the positions of my

device buffers d_data1 becomes to d_data2 and so on.

All the work is done simultaniously on the gpu (copy to device, kernel, copy to host) and on each call of this filter

one image is processed. The CPU can do some other work during the async execution, e. g. calculate some other

filters. The advantage is, that this implementation don’t need much CPU time. My async calls need only about 1 ms and

the CPU can do some other work.

You can see it as a software pipeline. The drawback is, that it has a latency of 4 frames, but in my application it doen’t matter.

// Synchronization (wait until calculations of the frame before are completed)

	cudaStreamSynchronize (stream1);

	cudaStreamSynchronize (stream2);

	cudaStreamSynchronize (stream3);

	// copy data to pinned memory (necessary for async transfer)

	memcpy((void*) input_pinned, (void*) input->image()->imageData, mem_size);

	//Async transfer of host input data to device memory d_data1 (stream1)

	cudaMemcpy2DAsync(d_data1, pitch, input_pinned, input->image()->widthStep, input->image()->width, input->image()->height, cudaMemcpyHostToDevice, stream1);

	// Async Kernel Call Filter (stream2) 

	cudaStartKernel(d_data2, d_data3, input->image(), stream2, pitch);

	//Async transfer of device memeory d_data4 t to host memory output_pinned

	cudaMemcpy2DAsync(output_pinned, input->image()->widthStep, d_data4, pitch, input->image()->width, input->image()->height, cudaMemcpyDeviceToHost, stream3);

	// switch gpu buffers (ringbuffer)

	tmp = d_data4;

	d_data4 = d_data3;

	d_data3 = d_data2;

	d_data2 = d_data1;

	d_data1 = tmp;

Joky · August 29, 2009, 1:33pm

Nice ! But the question remain the same : what is the size of your buffers ? Tens of MB or some kB ?

CapJo · August 31, 2009, 2:27pm

I use a quite small buffer with only 75 kB.

I didn’t measured the “latency overhead” of the async copy, because my problem perfectly fits to the architecture I described below.

However, I guess that there might be an latency overhead, because the CPU must initiate this DMA transfer on the GPU, but afterwards

the memory is copied by the DMA controller on the GPU, without any CPU intererence.

Joky · August 31, 2009, 3:05pm

Well in fact, I read your code, and probably it would even faster without streaming.

Tobi_W · September 1, 2009, 9:09am

I had the same problem some time ago. Streams slowed down my application, and without them it was very fast. The reason was very simple: my buffers were very small and the execution time of the kernel, too. The overhead of the streams were just too big to save some runtime, the kernel and memory copies were done before some overlapping could happen.

Topic		Replies	Views
How lightweight are cudaStream_t's? CUDA Programming and Performance	6	1135	September 26, 2018
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	125	August 22, 2024
Much slower async memcpy in a separate stream than in stream 0 CUDA Programming and Performance	4	5196	July 23, 2015
cudaMemcpyAsync CUDA Programming and Performance	10	20734	October 16, 2015
cudaMemcpyPeerAsync Launch Overhead CUDA Programming and Performance	2	311	October 4, 2023
Slow memory transfers CUDA Programming and Performance	7	1994	May 23, 2011
Is concurrent cudaMemcpyAsync possible? CUDA Programming and Performance	2	636	April 5, 2019
Overhead of using non-default stream with cudaMemcpyAsync() too high? CUDA Programming and Performance	2	2135	August 5, 2009
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1044	December 15, 2022
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	886	July 7, 2017

Overhead using cudaMemcpyAsync

Related topics