Hello.
Basically I’m working with 2 matrixes. I copy them to Global Memory:
(Code of 1)
A_h = (float ) malloc(N * M * sizeof(float));
cudaMalloc((void **) &A_d, NM * sizeof(float)));
cudaMemcpy(A_d, A_h, N*M * sizeof(float), cudaMemcpyHostToDevice));
…kernel_call…
int gid = (blockIdx.x * blockDim.x + threadIdx.x);
if (gid < M){
for (i=0; i < N; i++){
…
A[i*M + gid] = 0.0f //For example
…
}
}
… kernel return…
cudaMemcpy(A_h, A_d, N*M * sizeof(float), cudaMemcpyDeviceToHost)
cudaFree(A_d);
free(A_h);
Well. When N and M=3000, this last cudaMemcpy fails. If I delete the matrix A instruction in the kernel, the program works correctly (I can copy the matrix A from device memory to host memory). What’s happening?
I don’t use shared memory, my device has 512 MB GDDR3 and 3000x3000*sizeof(float) * 2 matrixes = 69 MB! I have tried in a Tesla with 1.5GB and I have the same problem.
If N=M < 3000 I don’t have any problem. I check the result of my algorithm and it’s correct. I’ve tried with -deviceemu and everything works ok.
Of course, I’ve restarted the computer many times.
Any help?
Thank you so much.