why all zero?

there is a strange phenomenon,when this program iterate over 1024times or more,the global data in Upper matrix(2048 * 2048 int) are all zero.
program below:
//Upper is in global memory
//this is a implement of floyd arithmetic
global void UpperGPU2withoutWR(type *Upper)
{
int r,c;
unsigned int k = 0;
unsigned int i = 0;
type temp;
for (k = 0;k < HERE!!! ;k++) //when HERE is over 1024 or more
{
for (i = threadIdx.x;i < N * N;i += blockDim.x)
{
r = i / N;
c = i - r * N;
temp = Upper[r * N + k] + Upper[k * N + c];
if (Upper[i] > temp)
Upper[i] = temp;
}
__syncthreads();
}
}
the input matrix Upper is like:
9999 1 9999 9999 9999 9999 9999 9999
1 9999 3 9999 9999 9999 9999 9999
9999 3 9999 1 9999 9999 9999 9999
9999 9999 1 9999 3 9999 9999 9999
9999 9999 9999 3 9999 1 9999 9999
9999 9999 9999 9999 1 9999 3 9999
9999 9999 9999 9999 9999 3 9999 1
9999 9999 9999 9999 9999 9999 1 9999
PS:9999 means it is unreachable.
It seems that when iterate too many times,reading global memory will fail,maybe.
I did know nothing about this,how can I solve this problem?
thanks

Does your kernel call last for ~5s? Do you get a “launch timed out and was terminated” error after the call? Are you even checking for errors after the kernel call?

See FAQ Programming questions #33
http://forums.nvidia.com/index.php?showtopic=36286

No such error occured,and I have used CUT_CHECK_ERROR() after kernel.

I don’t think it is such problem. Do you have other ideas? thanks

Did you compile in debug mode, though? CUT_CHECK_ERROR does nothing if NDEBUG is #defined.