how to speed up? data transfer

Hi, i have 2 big arrays(each cca. 256MB) and i have to do matrix addition and store it in another array…
Now my problem is that the data transfer is bottleneck since one cudaMemcpy() takes about 45ms and i have 3 of them which is about 135ms just for transfer + about 50ms for kernel execution(GTX 460)…
Is there some way to speed up this? Or the streams and ovelapping are the only way?
Sorry for my english

Hi! I will try to answer your question.

First of all you are saying that you are uploading 256MB@45ms? This means that you have a throughput of approx. 5.7GB/s! Are you already using page-locked memory?

One way to speed up the Memcpy is to use page-locked memory and use async. memcopy calls.

Another possible solution it to use “page-locked host memory” which enables you to give your kernel the CPU pointer of the storage and let the kernel stream in the data over PCI-e as needed and also put it back in the CPU memory when done.

Also you could set the flag to write combined if you now that you only read from certain pointers and write to others.

Check the CUDA programming guide 3.2 on how to use this. (chapter 3.2.5)

Yes i forgot to tell,i do use pinned memory.

Ok i tried with zero-copy access,and found that it is veery slow cca. 1480ms and with old cudaMemcpy() version it takes about 190ms.I am pretty sure that i am wrong somewhere,but i just cant’t figure out where…

Here’s my code so if you could help me…


#define HEIGHT  8*1024

#define WIDTH 8*1024

#define threadsX 16

#define threadsY 16

#define blocksX 512

#define blocksY 512

#include <cuda.h>

#include <stdio.h>

__global__ void add( int *a, int *b, int *c) {

	int x =threadIdx.x + blockIdx.x * blockDim.x; // handle the data at this index

	int y =threadIdx.y + blockIdx.y * blockDim.y;	

	int offset = y + x * HEIGHT;

	while (y < HEIGHT) {

		while (x < WIDTH) {

			c[offset] =a[offset] + b[offset];			

			x += blockDim.x * gridDim.x;

			offset = y + x * HEIGHT;


		x =threadIdx.x + blockIdx.x * blockDim.x;

		y += blockDim.y * gridDim.y;

		offset = y + x * HEIGHT;



int main( void ) {

	int *a, *b, *c;

	int *dev_a, *dev_b, *dev_c;	

	cudaEvent_t start, stop;


	cudaHostAlloc( &a, HEIGHT * WIDTH * sizeof (int), cudaHostAllocMapped);

	cudaHostAlloc( &b, HEIGHT * WIDTH * sizeof (int), cudaHostAllocMapped);

	cudaHostAlloc( &c, HEIGHT * WIDTH * sizeof (int), cudaHostAllocMapped);

	cudaHostGetDevicePointer( &dev_a, a, 0);

	cudaHostGetDevicePointer( &dev_b, b, 0);

	cudaHostGetDevicePointer( &dev_c, c, 0);

	// fill the arrays 'a' and 'b' on the CPU

	for (int i=0; i<WIDTH; i++) {

		for (int j=0; j<HEIGHT; j++) {

			a[i * HEIGHT + j] = -i+j;

			b[i * HEIGHT + j] = i * j;



	// capture the start time

	cudaEventCreate( &start ) ;

	cudaEventCreate( &stop ) ;

	cudaEventRecord( start, 0 ) ; 

	dim3 threads (threadsX, threadsY);

	dim3 blocks (blocksX, blocksY );

	add <<<blocks,threads>>> (dev_a, dev_b, dev_c);



//capture the stop time

	cudaEventRecord( stop, 0 ) ;

	cudaEventSynchronize( stop ) ;

float elapsedTime;

	cudaEventElapsedTime( &elapsedTime,

	start, stop ) ;

	printf( "Time to generate: %3.1f ms\n", elapsedTime );

	cudaEventDestroy( start ) ;

	cudaEventDestroy( stop ) ; 




	return 0;


That kernel looks like a pretty convoluted way to do simple addition. Given the matrices are just stored in linear memory and have the same layout, why not just add corresponding words in each input array.

I made kernel like that so i could add matrices with infinitely number of elements,and not to be restricted by maximum number of threads…

You can do that with a single loop and five lines of kernel code.

How could i do that?

<template typename T>

__global__ void addkernel(const T *a, const T *b, T *c, const size_t n)


	size_t tidx = threadIdx.x + blockIdx.x * blockDim.x;

	size_t stride = blockDim.x * gridDim.x;

	for(size_t i=tidx; i<n; i+=stride) {

		c[i] = a[i] + b[i]; 



launch it with 8 blocks per multiprocessor and as many warps per block as will give full occupancy on the architecture you are using. No limits anywhere except size_t and GPU memory. All reads and writes should be coalesved. Should be faster than what you are using now.

You should be able to overlap the transfer to the device with the transfer from the device. See section 3.1.2 of the “best practices guide”.

I am also facing the same issue of data transfer bottleneck.

I have already used pinned memory(this gave me a significant improvement), tried zero copy, but it did not give any improvement.
So I am trying to use some kind of algorithm for data compression and transfer compressed data.

Has any body done this. please help me. Your ideas and thoughts are warmly welcome.

Unfortunately, that is not possible on the GeForce cards. Only one DMA engine is available.

Thanks!it did job very well…i menaged to get 138ms which is great!

You mean to launch it with 8*7 SM’s=56 blocks , and 1024 threads?I tried that and it takes about 153ms…

I read it but there is only about overlapping data transfer with execution.Did you mean to use 2 streams,one for transfer to device and kernel execution and one for transfer from device?Thanks!

If you need to access all elements once try zero-copy code i pasted,and fix a kernel with a avidday’s

Actually i think it’s possible on CC 2.1 devices…

Oh? Hmm, I have a GTS 450 here. Should test this…

Either that, or just do exactly what the best practices guide says. Transfers from device should be automatically overlapped with transfers to device.

But there’s also a question whether your system memory can handle two streams of 5+ GB/s at the same time.

I was under the impression that all Fermi cards have two DMA engines. And that PCI Express allows you to transfer data in two directions at the same time. I could be wrong.

It is possible though to overlap one transfer via DMA with a transfer via zerocopy (preferably in the opposite direction so they don’t compete for PCIe bandwidth).

The second DMA engine is only enabled for Tesla and Quadro Fermi cards, not for GeForce.

Are you sure that zero copy bypasses the DMA?

pretty much

You’re quite right.

I wrote a test app. On a GeForce 560, I see almost no speedup from streams if I use cudaMemcpy to transfer data both ways. But there’s a substantial speedup from using cudaMemcpy to send data to the device and zero-copy to send data back.