Strange problem with matrix in Global Memory

Hello.
Basically I’m working with 2 matrixes. I copy them to Global Memory:
(Code of 1)

A_h = (float ) malloc(N * M * sizeof(float));
cudaMalloc((void **) &A_d, N
M * sizeof(float)));
cudaMemcpy(A_d, A_h, N*M * sizeof(float), cudaMemcpyHostToDevice));

…kernel_call…

int gid = (blockIdx.x * blockDim.x + threadIdx.x);

if (gid < M){
for (i=0; i < N; i++){

A[i*M + gid] = 0.0f //For example

}
}
… kernel return…

cudaMemcpy(A_h, A_d, N*M * sizeof(float), cudaMemcpyDeviceToHost)
cudaFree(A_d);
free(A_h);

Well. When N and M=3000, this last cudaMemcpy fails. If I delete the matrix A instruction in the kernel, the program works correctly (I can copy the matrix A from device memory to host memory). What’s happening?
I don’t use shared memory, my device has 512 MB GDDR3 and 3000x3000*sizeof(float) * 2 matrixes = 69 MB! I have tried in a Tesla with 1.5GB and I have the same problem.
If N=M < 3000 I don’t have any problem. I check the result of my algorithm and it’s correct. I’ve tried with -deviceemu and everything works ok.
Of course, I’ve restarted the computer many times.

Any help?
Thank you so much.

Hello,

I can’t see coding mistakes, too. Are there other memory allocations?

Try a function like…

void checkCudaError(const char *msg)

{

	cudaError_t err = cudaGetLastError();

	if( cudaSuccess != err)

	{

		fprintf(stderr, "CUDA error> %s %s.\n", msg, cudaGetErrorString( err) );

 		exit(EXIT_FAILURE);

	}

}

… to get more information from your device. Call it after every cuda command, because it only shows the last error.

Best regards

Ok, now I’m using cutils to check the error.
The kernel return “the launch timed out and was terminated”…
Watchdog error? I’m using 2.1beta!