We have been experiencing some strange behavior when allocating global and texture memory and repeatedly copying from global to texture memory. We have written a small test program that replicates the behavior of our application.
We timed the cudaMemcpy and for iterations 0-1134 the average time was ~10 microseconds, after that it jumped up to ~275 microseconds and stays there.
We are using a Tesla C1060 with CUDA 2.1.
Does anyone have any insight into why this might be happening?
Thanks!
cudaArray *cuda_arr;
unsigned short *gpu_arr;
unsigned int size_y(1000), stride(1000);
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(16, 0, 0, 0, cudaChannelFormatKindUnsigned);
cudaMallocArray( &cuda_arr, &channelDesc, stride, size_y );
cudaMalloc((void **)&gpu_arr, stride*size_y*sizeof(short int));
for(int i=0;i<32000;i++){
cudaMemcpyToArray(cuda_arr, 0, 0, gpu_arr, stride * size_y * sizeof(unsigned short), cudaMemcpyDeviceToDevice);
}