Timeout in CUDA on GTX465 WinXp

I am using CUDA SDK 3.1 on MS VS2005 with GPU GTX465 1 GB. I have such a kernel function:

__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)


int holo_x = blockIdx.x*20 + threadIdx.x;

  int holo_y = blockIdx.y*20 + threadIdx.y;

float k=2.0f*3.14f/0.000000054f;

if (firstTime[0]==1.0f)




for (int i=0; i<pointsNumber[0]; i++)






and this is function which calls kernel function:

extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)


 dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);

 dim3 threadBlockRows(20, 20);

CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber); 

 CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");

 CUDA_SAFE_CALL( cudaThreadSynchronize() );


I am loading in loop all the paramteres to this function (for example 4096 elements for each parameter in one loop iteration). In total I want to make this kernel for 32768 elements for each parameter after all loop iterations.

The MAX_FINAL_X is 1920 and MAX_FINAL_Y is 1080.

When I am starting alghoritm first iteration goes very fast and after one or two iteration more I get information about CUDA timeout error. I used this alghoritm on GPU gtx260 and it was doing better as far as I remember…

Could You help me… maybe I am doing some mistake according to new Fermi arch in this algorithm?

I solved problem with timeout… I had a mistake in loading pointsNumber parameter (it was growing with each iteration) and I have limited number of points in each iteration to 2048… maybe there is a way to upload more points without timeout issue?