Strange behavior on threads number increace

I have code wich calculate some expression for each element of 1000x1000 matrix using all other matrix elements. It works fine in emulation mode and when I use less than 6400 (dimblock = 16x16, dimgrid = 5x5) threads. But strange thigs begin to happen if I try to launch fullsize calculation for dimgrid = 62x62 - execution time stays the same and all results I have is zeros. More than that: if I copy my initial matrix back from devise to host it contains zeros only. If I do the same for dimgrid = 5x5, matrix elements values are valid.

...

	/* Copy data to device. */

	cudaMemcpy(therm_d, therm, size, cudaMemcpyHostToDevice);

	/* Proceed data in parallel. */

	unsigned int blocksize = 16;

	unsigned int nblocks = 62;

	dim3 dimBlock(blocksize, blocksize);

	dim3 dimGrid(nblocks, nblocks);

	compute_therm_cuda_kernel <<< dimGrid, dimBlock >>> (x_max, y_max, z, therm_d, final_result_d);

	/* Retrieve result from device. */

	cudaMemcpy(final_result_h, final_result_d, size, cudaMemcpyDeviceToHost);

	cudaMemcpy(therm1, therm_d, size, cudaMemcpyDeviceToHost);

...

What can be the reason in this situation?

P.S. I`m running this on archlinux with X started. May be this is the problem?

P.P.S. Sorry for my bad Enlish.

UPD. If I do no actual calculations and just assign some values to final_result_d, everything is OK.

If you have more than 512 threads per block, the kernel will fail to launch and do nothing. With a 16x16 = 256 thread block, you are fine. A 62x62=3844 thread block will not work.

just check the errors…

// check if kernel execution generated and error
CUT_CHECK_ERROR(“Kernel execution failed”);