On page 64 of programming guide 1.0, it says “The maximum number of blocks that can run concurrently on a multiprocessor is 8” and “The maximum number of warps that can concurrently on a multiprocessor is 24”. Does it mean the maxiumum number of blocks is 8*16=128 on a 8800 GTX? Then, it greatly limits the size of the matrix multiplication on page 59-61.
Using that algorithm, one block is 1616 threads (and also the number of elements of submatrix), so even for A: 800800 and B:800800, it is too much. Because it has (800/16)(800/16) blocks(that is 2500).
I tried it on the gpu, but it still produce the right answer. However, if I try 80008000 multiply 80008000, it produce all 0 and never stop.
I want to test the FLOPS of GPU, so I need a big matrix multiplication. But with the constraints mentioned above, I can only make wA bigger. Something like 128800000 multiply 80000096. In that case, it produces the right answer, but if I change wA to 960000, then it is segmentation fault. Any idea why?