cudaMemcpy slow down

We have been experiencing some strange behavior when allocating global and texture memory and repeatedly copying from global to texture memory. We have written a small test program that replicates the behavior of our application.

We timed the cudaMemcpy and for iterations 0-1134 the average time was ~10 microseconds, after that it jumped up to ~275 microseconds and stays there.

We are using a Tesla C1060 with CUDA 2.1.

Does anyone have any insight into why this might be happening?

Thanks!

cudaArray *cuda_arr;

   unsigned short *gpu_arr;

   unsigned int size_y(1000), stride(1000);

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(16, 0, 0, 0, cudaChannelFormatKindUnsigned);

   cudaMallocArray( &cuda_arr, &channelDesc, stride, size_y );

   cudaMalloc((void **)&gpu_arr, stride*size_y*sizeof(short int));

for(int i=0;i<32000;i++){

	  cudaMemcpyToArray(cuda_arr, 0, 0, gpu_arr, stride * size_y * sizeof(unsigned short), cudaMemcpyDeviceToDevice);

   }

What OS are you using?

Linux 2.6.23.1-42.fc8 #1 SMP Tue Oct 30 13:18:33 EDT 2007 x86_64 GNU/Linux