Possibly Studpid question bout cudaMemcpy CudaMemcpy getting slow by time

Hi there,

I’m working on a GTX295 (3.0beta driver) and do something like this:

int *a, *b;

cudaMalloc((void**) &a,  memSize);

cudaMalloc((void**) &b,  memSize);

//some memory initialization on a and b ...

for (int i=0,i<bigvalue;i++) {


  cudaMemcpy( a, b, memSize, cudaMemcpyDeviceToDevice );



I do not free memory in the loop nor do I reallocate any memory.

The copy takes longer and longer evry time I run through the loop. Why could this be?

It is starting with about 0.005ms

After about 500 runs it already takes 0.147ms and no end in sight! :(

If my code is not appropriate - please tell what I should post…


Do you have the same behaviour if you put a cudaThreadSynchronize right after the cudaMemcpy ? (or at the end of the loop so that you overlap some computation). Perhaps you have too much pending transfers ?

Just my 2 cents,

Hi Cedric,

yes with a cudaThreadSyncronize after the cudaMemcpy i got the same result…

I must admit I can’t see how that can become so slow, perhaps you have some small piece of code everyone could try to see if we have the same behaviour and whether this happens ? I’m pretty curious to see what is happening.


Hi Cedric,

thanks (in advance) for your help and interest. I have solved the problem by doing a workaround, that would even be better than the memcopy-thing if memcopy would work fine:

for (int somecounter=0;somecounter<119046;++somecounter){

//do much calculation thats time-measured and working kindof quick.

sharedMemSize = numThreadsPerBlock * sizeof(int);

dim3 dimBlock(numThreadsPerBlock); 


cudaMemcpy( d_temp1, d_sum, memSize/64, cudaMemcpyDeviceToDevice ); //GETTING WORSE

for (int s=linescount;s>1;s=s) {

  int numBlocks = s / (2 * numThreadsPerBlock);

  if ((s%(2*numThreadsPerBlock))!=0)


  dim3 dimGrid(numBlocks);

  reduce <<< dimGrid, dimBlock, sharedMemSize >>> (d_temp1, d_temp2, s);

  cudaMemcpy( d_temp1, d_temp2, numBlocks*sizeof(int), cudaMemcpyDeviceToDevice ); //GETTING WORSE

  s = numBlocks;




PRINT_TIMER("kernel reduce: %f\n",0);


//do the rest of the calculation


The lines marked with “GETTING WORSE” are the ones that are getting slower by time.

Some more information about data sizes and so on:

  • d_temp1 and d_temp2 are allocated once outside all the loops at the beginning of main()

  • no data is beeing reallocated ever in any loop

  • d_temp1 and d_temp2 are never used anywhere else than here

  • d_sum changes in every outer loop and is getting its data from calculation done by other kernels

  • the size of d_sum is 11904664sizeof(int)

I cant reproduce this behaviour in a minimal code. So there must be any weird error that I cannot see. Perhaps you find an error in this excerpt.

Buy - Julian