Really strange memcpy time in matrixMul at SDK

I am stucked with this problem. Can you guys help me out? Thanks.

I tried to record the data transfer time CPU->GPU, GPU->CPU in matrixMul project.

There are two memcpy for host->device (for matrix A, B), one memcpy for device->host. (for matrix C)

I set the datasize as the same for A, B, C.

Problem is:

  1. the first memcpy from host ->device is much less than the second host->device. (the second one is 3 times or more than the first.) I have taken care of threadSynchronize.

  2. host->device time is much larger than device->host. I checked the bandwidthTest result, both directions have similar bandwidth.

  3. problem 1 and 2 happen for both single precision and double precision.

  4. The time for my kernel with double precision is 7 times as that with single precsion. Both single precsion and double precision have the same functionality except that one is float, one is double.

I insert the record time code like this for recording time of host->device. I did the same thing for device->host.

unsigned int timer = 0;

   cutilSafeCall( cudaThreadSynchronize() );



	// copy host memory to device

	cutilSafeCall(cudaMemcpy(d_A, h_A, mem_size_A,

							  cudaMemcpyHostToDevice) );


	cutilSafeCall(cudaMemcpy(d_B, h_B, mem_size_B,

							  cudaMemcpyHostToDevice) );

cutilSafeCall( cudaThreadSynchronize() );



	printf("GPU time: %f (ms) \n", cutGetTimerValue(timer));


I am developing a kernel right now, where in I have to copy data row by row (with some pitch)

Since it has a pitch, I had to copy it using multiple cudaMemcpy – And, that KILLS the performance…

I wrote a separate application to verify that… And, yes, it is a big performance killer.

#include <stdio.h>

#include "PerformanceCounter.h"

#define NUM_THREADS 512

#define NUM_BLOCKS  512

cudaDeviceProp prop;

int *pinned;

int *gpu, *realGPU;

__global__ void mykernel(int *d)


  int id = blockIdx.x*blockDim.x + threadIdx.x;

d[id] = id;


int main(void)


   HPTimer profiler;

SetThreadAffinityMask(GetCurrentThread(), 1);



   if (prop.canMapHostMemory == false)


	 printf("Device 0 canNOT map Host memory!\n");

	 return -1;


   cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost);

cudaMalloc((void**)&realGPU, NUM_THREADS*NUM_BLOCKS*sizeof(int));

cudaHostAlloc((void **)&pinned, NUM_THREADS*NUM_BLOCKS*sizeof(int), cudaHostAllocMapped);

   cudaHostGetDevicePointer((void**)&gpu, pinned, 0);


   mykernel <<< NUM_BLOCKS, NUM_THREADS >>> (gpu);



   printf("Time taken for %d amount of data with mapped memory = %f\n",

		   NUM_THREADS*NUM_BLOCKS, profiler.TimeInSeconds());


   mykernel <<< NUM_BLOCKS, NUM_THREADS >>> (realGPU);


   for(int i=0; i<NUM_BLOCKS; i++)


	 cudaMemcpy(pinned + i*NUM_THREADS, realGPU + i*NUM_THREADS,

				NUM_THREADS*sizeof(int), cudaMemcpyDeviceToHost);



   printf("Time taken for %d amount of data with normal kernel = %f\n",

		   NUM_THREADS*NUM_BLOCKS, profiler.TimeInSeconds());


The numbers I get on my TESLA C1060 are:

That is 0.007041 because of the multiple cudaMemcpy… If I had a single cudaMemcpy, I get almost equal time.

Makes it 18x times slower…But yeah, that is 512 calls to cudaMemcpy though…


I changed the 512 memcpys to 2 memcpys doing the same amount of data…

There was no significant change in time…

Meaning – the 2nd memcpy was as good as the 1st memcpy for the same amount of data.

but then, I am using pinned memory…

pinning will definitely make a difference, though…

So,I think this could be something to do with your “Matrix Size” and whether they are resident in memory or not…

Do these matrices reside in memory completely? (Do u have sufficient RAM?)

The matrix size is very small: 32*32

If you have time, you can run the matrixMul in SDK project, and record the time for matrix A transfer from host->device, matrix B transfer host->device and matrix C from device -> host. Then you will know what my problem is.

BTW, I am using Tesla c1060 as well.


If I have time , I will do that morrow…

I just went through the code. Could not find anything big…
If possible, try printing the CPU and GPU addresses. May be, that might give some clue…

Silly question, but why aren’t you using cudaMemcpy2D for transfers with different pitch?


That means cudaHostAlloc gives performance improvement then cudaMalloc.

In my code all the kernels working on outData which is allocated using cudaMalloc .So, if I replace allocation to zero copy functions(cudaHostAlloc ) then remaning code should leave unchange except cudaFree()&cudaMemcpy()??? Am I correct here?