I’m trying to do a basic matrix calculation on a square matrix. My kernel works great up to 1000x1000 matrices (or about that order of magnitude). I’m using Cuda 2.3 on a GTX 295.

However, the moment I bump it up to 10000x10000 my kernel merely returns an equally large matrix of zeros. It does the same for all values above. I’m using block size of 512 threads, which gives me a grid of 202,450 blocks (well below the limitation of 65535^2 = 4.29e9).

Memory wise, even the 10kx10k matrix of doubles only takes up 8 megabytes. It IS currently running on my display device but I’m sure the device has at least 10MB free out of 1.7GB.

I’m not breaking out of the int range (even the 10kx10k only has 10e8 elements, well below the 2.1e9 range of basic signed int) so where does this problem stem from?

Is there something else I need to be considering in all this?