Copy to/from device performance

We’ve just started to look what performance gains we could get from using cuda and the initial results is somewhat dull…

When copying a array to and from the device the average execution time for the “calc”-method is 0.24 ms for 10000 tries.

The size of the data is 320x256. Is this a reasonable time or have I missed something? Source code is provided below:

__global__ void calc(double* input,double* output){


extern "C" __declspec(dllexport) void preAlloc(int rows,int columns,void** cuDataPtr,void** cuResultPtr,int* pitch){

		cudaMallocPitch(cuDataPtr,(size_t *)pitch,(size_t)(rows*sizeof(double)),(size_t)columns);

	cudaMallocPitch(cuResultPtr,(size_t *)pitch,(size_t)(rows*sizeof(double)),(size_t)columns);


extern "C" __declspec(dllexport) void calc(void* cuDataPtr,double* input,void* cuResultPtr,double* output,int rows,int columns,int pitch){




0.24 ms seems like allot of time for just copy to and from the device on 640k of data using a x260-card?

((640 KiB * 1 024 bytes/KiB) / 0.24e-3 s) / (1 024^3 bytes / GiB) = 2.54313151 GiB/s

Which is a decent utilization of the PCI-e link. What more do you expect?

If you allocate the host memory with cudaMallocHost, the speed may bump up a little bit, but 4 GiB/s is the most you can reasonably expect.