Hi,

On page 64 of programming guide 1.0, it says “The maximum number of blocks that can run concurrently on a multiprocessor is 8” and “The maximum number of warps that can concurrently on a multiprocessor is 24”. Does it mean the maxiumum number of blocks is 8*16=128 on a 8800 GTX? Then, it greatly limits the size of the matrix multiplication on page 59-61.

Using that algorithm, one block is 16*16 threads (and also the number of elements of submatrix), so even for A: 800*800 and B:800*800, it is too much. Because it has (800/16)*(800/16) blocks(that is 2500).

I tried it on the gpu, but it still produce the right answer. However, if I try 8000*8000 multiply 8000*8000, it produce all 0 and never stop.

I want to test the FLOPS of GPU, so I need a big matrix multiplication. But with the constraints mentioned above, I can only make wA bigger. Something like 128*800000 multiply 800000*96. In that case, it produces the right answer, but if I change wA to 960000, then it is segmentation fault. Any idea why?

Many thanks,

Timtimac