I have code wich calculate some expression for each element of 1000x1000 matrix using all other matrix elements. It works fine in emulation mode and when I use less than 6400 (dimblock = 16x16, dimgrid = 5x5) threads. But strange thigs begin to happen if I try to launch fullsize calculation for dimgrid = 62x62 - execution time stays the same and all results I have is zeros. More than that: if I copy my initial matrix back from devise to host it contains zeros only. If I do the same for dimgrid = 5x5, matrix elements values are valid.
...
/* Copy data to device. */
cudaMemcpy(therm_d, therm, size, cudaMemcpyHostToDevice);
/* Proceed data in parallel. */
unsigned int blocksize = 16;
unsigned int nblocks = 62;
dim3 dimBlock(blocksize, blocksize);
dim3 dimGrid(nblocks, nblocks);
compute_therm_cuda_kernel <<< dimGrid, dimBlock >>> (x_max, y_max, z, therm_d, final_result_d);
/* Retrieve result from device. */
cudaMemcpy(final_result_h, final_result_d, size, cudaMemcpyDeviceToHost);
cudaMemcpy(therm1, therm_d, size, cudaMemcpyDeviceToHost);
...
What can be the reason in this situation?
P.S. I`m running this on archlinux with X started. May be this is the problem?
P.P.S. Sorry for my bad Enlish.
UPD. If I do no actual calculations and just assign some values to final_result_d, everything is OK.