Overhead using cudaMemcpyAsync

Hello !

I’m trying to stream my app. I have to launch the same kernel many times (few hundred to few thousand) on differents inputs. Before each launch I have to send 2 chunks of data (first is between a few kB and 3MB and second is 195kB). The kernel doesn’t do many computations and it’s very small. Basically it compute 5 memory reads, 1 mul, 3 add, and a final memory write.

I thought my problem would perfectly fit in streaming, however I have worst performance with than without streaming.

Of course memory is pinned with cudaMallocHost, and device is able to overlap : it’s a GTX295. Thinking about an overhead with cudaMemcpyAsync and small chunk of data, I just modified “n” var in simpleStreams SDK example to check that. As a result, streaming was always slower with less than 768000 int (3MB data).

Here is an extract of the output :

-------------------------------										  

> array_size   = 716800												  

memcopy:		0.55													 

kernel:		 0.88													 

non-streamed:   1.37 (1.44 expected)									 

4 streams:	  1.42 (1.02 expected with compute capability 1.1 or later)

-------------------------------										  

> array_size   = 768000												  

memcopy:		0.59													 

kernel:		 0.95													 

non-streamed:   1.46 (1.55 expected)									 

4 streams:	  1.45 (1.10 expected with compute capability 1.1 or later)

-------------------------------										  

> array_size   = 819200												  

memcopy:		0.63													 

kernel:		 1.01													 

non-streamed:   1.55 (1.63 expected)									 

4 streams:	  1.51 (1.17 expected with compute capability 1.1 or later)

-------------------------------

Well, can anyone confirm my experiments : cudaMemcpyAsync has to much overhead to allow stream to be used with small (<3-4MB) chunk of data ?

May I expect better result with zero-copy ?

I don’t know the exact overhead of the async copy, but I get good results with this architecture for video processing.

I have 4 Buffer, one input buffer (d_data1), the gpu is working/reading from d_data2 and writing the result to d_data3.

The result of the computation is copied from d_data4 to the host memory. Afterwards I switch the positions of my

device buffers d_data1 becomes to d_data2 and so on.

All the work is done simultaniously on the gpu (copy to device, kernel, copy to host) and on each call of this filter

one image is processed. The CPU can do some other work during the async execution, e. g. calculate some other

filters. The advantage is, that this implementation don’t need much CPU time. My async calls need only about 1 ms and

the CPU can do some other work.

You can see it as a software pipeline. The drawback is, that it has a latency of 4 frames, but in my application it doen’t matter.

// Synchronization (wait until calculations of the frame before are completed)

	cudaStreamSynchronize (stream1);

	cudaStreamSynchronize (stream2);

	cudaStreamSynchronize (stream3);

	// copy data to pinned memory (necessary for async transfer)

	memcpy((void*) input_pinned, (void*) input->image()->imageData, mem_size);

	//Async transfer of host input data to device memory d_data1 (stream1)

	cudaMemcpy2DAsync(d_data1, pitch, input_pinned, input->image()->widthStep, input->image()->width, input->image()->height, cudaMemcpyHostToDevice, stream1);

	// Async Kernel Call Filter (stream2) 

	cudaStartKernel(d_data2, d_data3, input->image(), stream2, pitch);

	//Async transfer of device memeory d_data4 t to host memory output_pinned

	cudaMemcpy2DAsync(output_pinned, input->image()->widthStep, d_data4, pitch, input->image()->width, input->image()->height, cudaMemcpyDeviceToHost, stream3);

	// switch gpu buffers (ringbuffer)

	tmp = d_data4;

	d_data4 = d_data3;

	d_data3 = d_data2;

	d_data2 = d_data1;

	d_data1 = tmp;

Nice ! But the question remain the same : what is the size of your buffers ? Tens of MB or some kB ?

I use a quite small buffer with only 75 kB.

I didn’t measured the “latency overhead” of the async copy, because my problem perfectly fits to the architecture I described below.

However, I guess that there might be an latency overhead, because the CPU must initiate this DMA transfer on the GPU, but afterwards

the memory is copied by the DMA controller on the GPU, without any CPU intererence.

Well in fact, I read your code, and probably it would even faster without streaming.

I had the same problem some time ago. Streams slowed down my application, and without them it was very fast. The reason was very simple: my buffers were very small and the execution time of the kernel, too. The overhead of the streams were just too big to save some runtime, the kernel and memory copies were done before some overlapping could happen.