Really strange memcpy time in matrixMul at SDK

duheng00 · July 7, 2009, 5:26pm

I am stucked with this problem. Can you guys help me out? Thanks.

I tried to record the data transfer time CPU->GPU, GPU->CPU in matrixMul project.

There are two memcpy for host->device (for matrix A, B), one memcpy for device->host. (for matrix C)

I set the datasize as the same for A, B, C.

Problem is:

the first memcpy from host ->device is much less than the second host->device. (the second one is 3 times or more than the first.) I have taken care of threadSynchronize.
host->device time is much larger than device->host. I checked the bandwidthTest result, both directions have similar bandwidth.
problem 1 and 2 happen for both single precision and double precision.
The time for my kernel with double precision is 7 times as that with single precsion. Both single precsion and double precision have the same functionality except that one is float, one is double.

I insert the record time code like this for recording time of host->device. I did the same thing for device->host.

unsigned int timer = 0;

   cutilSafeCall( cudaThreadSynchronize() );

	cutilCheckError(cutCreateTimer(&timer));

	cutilCheckError(cutStartTimer(timer));

	// copy host memory to device

	cutilSafeCall(cudaMemcpy(d_A, h_A, mem_size_A,

							  cudaMemcpyHostToDevice) );

	

	cutilSafeCall(cudaMemcpy(d_B, h_B, mem_size_B,

							  cudaMemcpyHostToDevice) );

cutilSafeCall( cudaThreadSynchronize() );

	 

 cutilCheckError(cutStopTimer(timer));

	printf("GPU time: %f (ms) \n", cutGetTimerValue(timer));

cutilCheckError(cutDeleteTimer(timer));

Sarnath · July 7, 2009, 6:22pm

I am developing a kernel right now, where in I have to copy data row by row (with some pitch)

Since it has a pitch, I had to copy it using multiple cudaMemcpy – And, that KILLS the performance…

I wrote a separate application to verify that… And, yes, it is a big performance killer.

#include <stdio.h>

#include "PerformanceCounter.h"

#define NUM_THREADS 512

#define NUM_BLOCKS  512

cudaDeviceProp prop;

int *pinned;

int *gpu, *realGPU;

__global__ void mykernel(int *d)

{ 

  int id = blockIdx.x*blockDim.x + threadIdx.x;

d[id] = id;

}

int main(void)

{

   HPTimer profiler;

SetThreadAffinityMask(GetCurrentThread(), 1);

   cudaSetDevice(0);

   cudaGetDeviceProperties(&prop,0);

   if (prop.canMapHostMemory == false)

   {

	 printf("Device 0 canNOT map Host memory!\n");

	 return -1;

   }

   cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost);

cudaMalloc((void**)&realGPU, NUM_THREADS*NUM_BLOCKS*sizeof(int));

cudaHostAlloc((void **)&pinned, NUM_THREADS*NUM_BLOCKS*sizeof(int), cudaHostAllocMapped);

   cudaHostGetDevicePointer((void**)&gpu, pinned, 0);

   profiler.start();

   mykernel <<< NUM_BLOCKS, NUM_THREADS >>> (gpu);

   cudaThreadSynchronize();

   profiler.stop();

   printf("Time taken for %d amount of data with mapped memory = %f\n",

		   NUM_THREADS*NUM_BLOCKS, profiler.TimeInSeconds());

profiler.start();

   mykernel <<< NUM_BLOCKS, NUM_THREADS >>> (realGPU);

   cudaThreadSynchronize();

   for(int i=0; i<NUM_BLOCKS; i++)

   {

	 cudaMemcpy(pinned + i*NUM_THREADS, realGPU + i*NUM_THREADS,

				NUM_THREADS*sizeof(int), cudaMemcpyDeviceToHost);

   }

   profiler.stop();

   printf("Time taken for %d amount of data with normal kernel = %f\n",

		   NUM_THREADS*NUM_BLOCKS, profiler.TimeInSeconds());

}

The numbers I get on my TESLA C1060 are:

That is 0.007041 because of the multiple cudaMemcpy… If I had a single cudaMemcpy, I get almost equal time.

Makes it 18x times slower…But yeah, that is 512 calls to cudaMemcpy though…

Sarnath · July 7, 2009, 6:31pm

btw,

I changed the 512 memcpys to 2 memcpys doing the same amount of data…

There was no significant change in time…

Meaning – the 2nd memcpy was as good as the 1st memcpy for the same amount of data.

but then, I am using pinned memory…

pinning will definitely make a difference, though…

So,I think this could be something to do with your “Matrix Size” and whether they are resident in memory or not…

Do these matrices reside in memory completely? (Do u have sufficient RAM?)

duheng00 · July 7, 2009, 6:41pm

The matrix size is very small: 32*32

If you have time, you can run the matrixMul in SDK project, and record the time for matrix A transfer from host->device, matrix B transfer host->device and matrix C from device → host. Then you will know what my problem is.

BTW, I am using Tesla c1060 as well.

Thanks.

Sarnath · July 7, 2009, 6:51pm

If I have time , I will do that morrow…

I just went through the code. Could not find anything big…
btw,
If possible, try printing the CPU and GPU addresses. May be, that might give some clue…

Nico · July 8, 2009, 6:59am

Silly question, but why aren’t you using cudaMemcpy2D for transfers with different pitch?

N.

CUDAkk · July 9, 2009, 7:41am

I am developing a kernel right now, where in I have to copy data row by row (with some pitch)

Since it has a pitch, I had to copy it using multiple cudaMemcpy – And, that KILLS the performance…

I wrote a separate application to verify that… And, yes, it is a big performance killer.

#include <stdio.h>

#include "PerformanceCounter.h"

#define NUM_THREADS 512

#define NUM_BLOCKS  512

cudaDeviceProp prop;

int *pinned;

int *gpu, *realGPU;

__global__ void mykernel(int *d)

{ 

  int id = blockIdx.x*blockDim.x + threadIdx.x;

d[id] = id;

}

int main(void)

{

   HPTimer profiler;

SetThreadAffinityMask(GetCurrentThread(), 1);

   cudaSetDevice(0);

   cudaGetDeviceProperties(&prop,0);

   if (prop.canMapHostMemory == false)

   {

	 printf("Device 0 canNOT map Host memory!\n");

	 return -1;

   }

   cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost);

cudaMalloc((void**)&realGPU, NUM_THREADS*NUM_BLOCKS*sizeof(int));

cudaHostAlloc((void **)&pinned, NUM_THREADS*NUM_BLOCKS*sizeof(int), cudaHostAllocMapped);

   cudaHostGetDevicePointer((void**)&gpu, pinned, 0);

   profiler.start();

   mykernel <<< NUM_BLOCKS, NUM_THREADS >>> (gpu);

   cudaThreadSynchronize();

   profiler.stop();

   printf("Time taken for %d amount of data with mapped memory = %f\n",

		   NUM_THREADS*NUM_BLOCKS, profiler.TimeInSeconds());

profiler.start();

   mykernel <<< NUM_BLOCKS, NUM_THREADS >>> (realGPU);

   cudaThreadSynchronize();

   for(int i=0; i<NUM_BLOCKS; i++)

   {

	 cudaMemcpy(pinned + i*NUM_THREADS, realGPU + i*NUM_THREADS,

				NUM_THREADS*sizeof(int), cudaMemcpyDeviceToHost);

   }

   profiler.stop();

   printf("Time taken for %d amount of data with normal kernel = %f\n",

		   NUM_THREADS*NUM_BLOCKS, profiler.TimeInSeconds());

}

The numbers I get on my TESLA C1060 are:

That is 0.007041 because of the multiple cudaMemcpy… If I had a single cudaMemcpy, I get almost equal time.

Makes it 18x times slower…But yeah, that is 512 calls to cudaMemcpy though…

That means cudaHostAlloc gives performance improvement then cudaMalloc.

In my code all the kernels working on outData which is allocated using cudaMalloc .So, if I replace allocation to zero copy functions(cudaHostAlloc ) then remaning code should leave unchange except cudaFree()&cudaMemcpy()??? Am I correct here?

Topic		Replies	Views
Why is there the difference of memory copy speed between cpu>gpu and gpu>cpu CUDA Programming and Performance	3	1278	April 10, 2014
Why does my kernel take too long occasionally? CUDA Programming and Performance	21	8764	October 13, 2010
Handful of Slow Memory Transfers CUDA Programming and Performance	7	813	June 17, 2016
Slow memory transfers CUDA Programming and Performance	7	1990	May 23, 2011
matrix multiplication with its transpose in cuda(cudamemcpy from device to host not working) . CUDA Programming and Performance	6	1746	October 5, 2018
Copying memory from device to Host takes too much time CUDA Programming and Performance	7	3392	October 5, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16291	January 30, 2011
How to explain the performance difference? CUDA Programming and Performance	7	3506	March 26, 2008
bets way to return a float value sync or assync CUDA Programming and Performance	26	10314	May 7, 2009
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009

Really strange memcpy time in matrixMul at SDK

Related topics