I have code wich calculate some expression for each element of 1000x1000 matrix using all other matrix elements. It works fine in emulation mode and when I use less than 6400 (dimblock = 16x16, dimgrid = 5x5) threads. But strange thigs begin to happen if I try to launch fullsize calculation for dimgrid = 62x62 - execution time stays the same and all results I have is zeros. More than that: if I copy my initial matrix back from devise to host it contains zeros only. If I do the same for dimgrid = 5x5, matrix elements values are valid.
... /* Copy data to device. */ cudaMemcpy(therm_d, therm, size, cudaMemcpyHostToDevice); /* Proceed data in parallel. */ unsigned int blocksize = 16; unsigned int nblocks = 62; dim3 dimBlock(blocksize, blocksize); dim3 dimGrid(nblocks, nblocks); compute_therm_cuda_kernel <<< dimGrid, dimBlock >>> (x_max, y_max, z, therm_d, final_result_d); /* Retrieve result from device. */ cudaMemcpy(final_result_h, final_result_d, size, cudaMemcpyDeviceToHost); cudaMemcpy(therm1, therm_d, size, cudaMemcpyDeviceToHost); ...
What can be the reason in this situation?
P.S. I`m running this on archlinux with X started. May be this is the problem?
P.P.S. Sorry for my bad Enlish.
UPD. If I do no actual calculations and just assign some values to final_result_d, everything is OK.