Matrix Size Limitation

I’m trying to do a basic matrix calculation on a square matrix. My kernel works great up to 1000x1000 matrices (or about that order of magnitude). I’m using Cuda 2.3 on a GTX 295.

However, the moment I bump it up to 10000x10000 my kernel merely returns an equally large matrix of zeros. It does the same for all values above. I’m using block size of 512 threads, which gives me a grid of 202,450 blocks (well below the limitation of 65535^2 = 4.29e9).

Memory wise, even the 10kx10k matrix of doubles only takes up 8 megabytes. It IS currently running on my display device but I’m sure the device has at least 10MB free out of 1.7GB.

I’m not breaking out of the int range (even the 10kx10k only has 10e8 elements, well below the 2.1e9 range of basic signed int) so where does this problem stem from?

Is there something else I need to be considering in all this?

I think you maths is slightly off. 10k x 10k = 100e6 * 8 bytes per double = 800e6 bytes or 780.25Mb. You card only has 896Mb of ram per gpu, so it seems pretty likely you are running out of memory.

Argh, I caught that a little later than I should have. It wasn’t a memory problem though… I actually figured a way around that (finally learned about 2d arrays of blocks into grids).

So NOW I’m memory limited (can’t get to 20k x 20k) but I suppose I can just split the matrix and tackle each quadrant seperately. Thanks for the help!