I’m working on a GTX295 (3.0beta driver) and do something like this:
int *a, *b;
cudaMalloc((void**) &a, memSize);
cudaMalloc((void**) &b, memSize);
//some memory initialization on a and b ...
for (int i=0,i<bigvalue;i++) {
//somecode
cudaMemcpy( a, b, memSize, cudaMemcpyDeviceToDevice );
//somecode
}
I do not free memory in the loop nor do I reallocate any memory.
The copy takes longer and longer evry time I run through the loop. Why could this be?
It is starting with about 0.005ms
After about 500 runs it already takes 0.147ms and no end in sight! :(
If my code is not appropriate - please tell what I should post…
Do you have the same behaviour if you put a cudaThreadSynchronize right after the cudaMemcpy ? (or at the end of the loop so that you overlap some computation). Perhaps you have too much pending transfers ?
I must admit I can’t see how that can become so slow, perhaps you have some small piece of code everyone could try to see if we have the same behaviour and whether this happens ? I’m pretty curious to see what is happening.
thanks (in advance) for your help and interest. I have solved the problem by doing a workaround, that would even be better than the memcopy-thing if memcopy would work fine:
for (int somecounter=0;somecounter<119046;++somecounter){
//do much calculation thats time-measured and working kindof quick.
sharedMemSize = numThreadsPerBlock * sizeof(int);
dim3 dimBlock(numThreadsPerBlock);
START_TIMER;
cudaMemcpy( d_temp1, d_sum, memSize/64, cudaMemcpyDeviceToDevice ); //GETTING WORSE
for (int s=linescount;s>1;s=s) {
int numBlocks = s / (2 * numThreadsPerBlock);
if ((s%(2*numThreadsPerBlock))!=0)
numBlocks++;
dim3 dimGrid(numBlocks);
reduce <<< dimGrid, dimBlock, sharedMemSize >>> (d_temp1, d_temp2, s);
cudaMemcpy( d_temp1, d_temp2, numBlocks*sizeof(int), cudaMemcpyDeviceToDevice ); //GETTING WORSE
s = numBlocks;
}
cudaThreadSynchronize();
STOP_TIMER;
PRINT_TIMER("kernel reduce: %f\n",0);
RESET_TIMER;
//do the rest of the calculation
}
The lines marked with “GETTING WORSE” are the ones that are getting slower by time.
Some more information about data sizes and so on:
d_temp1 and d_temp2 are allocated once outside all the loops at the beginning of main()
no data is beeing reallocated ever in any loop
d_temp1 and d_temp2 are never used anywhere else than here
d_sum changes in every outer loop and is getting its data from calculation done by other kernels
the size of d_sum is 11904664sizeof(int)
I cant reproduce this behaviour in a minimal code. So there must be any weird error that I cannot see. Perhaps you find an error in this excerpt.