max matrix size in matrix multiplication matrix example in programming guide


On page 64 of programming guide 1.0, it says “The maximum number of blocks that can run concurrently on a multiprocessor is 8” and “The maximum number of warps that can concurrently on a multiprocessor is 24”. Does it mean the maxiumum number of blocks is 8*16=128 on a 8800 GTX? Then, it greatly limits the size of the matrix multiplication on page 59-61.

Using that algorithm, one block is 1616 threads (and also the number of elements of submatrix), so even for A: 800800 and B:800800, it is too much. Because it has (800/16)(800/16) blocks(that is 2500).

I tried it on the gpu, but it still produce the right answer. However, if I try 80008000 multiply 80008000, it produce all 0 and never stop.

I want to test the FLOPS of GPU, so I need a big matrix multiplication. But with the constraints mentioned above, I can only make wA bigger. Something like 128800000 multiply 80000096. In that case, it produces the right answer, but if I change wA to 960000, then it is segmentation fault. Any idea why?

Many thanks,


The only limitation in the number of blocks that can be run is the grid dimensions (65535x 65535)… though I wouldn’t want to try and run such a big grid. The limit you quote from the guide is the maximum number that can run concurrently. That is, all of the warps are on a multiproc and being time sliced. The blocks are aren’t currently running sit in a scheduler pool or something and are started up on a multiproc when there is a slot.

For the matrix multiply example, you should be able to push the dimensions as big as there is memory for.

If you want to benchmark SGEMM, use the cublas library.
It is faster and the size of the matrices is limited only by the amount of memory on the card. For best performance, the size should be a multiple of 32.

“About pushing dimensions as big as there is memory for”, since a thread block of 1616 would use around 161642=2kb of shared memory, and there is only 16kb per multiprocessor, does that mean I can only have 16/2 * 16 = 128 blocks? Or I could have more than that because some sitting in the scheduler pool won’t count for the shared memory? (but we will not know for how long the blocks in the pool will wait, until the ones being time sliced all terminate?)

on page 60 of programming guide,

__global__ void Muld(...) {


for (int a = aBegin, b=bBegin; a<=aEnd; a+=aStep, b+=bStep) {

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];





Does it matter whether I move the shared memory float array As and Bs before the for loop?

What I don’t understand here is that it seems to me every thread declare As and Bs many times instead of as a block declaration. Since shared memory is used by all block within the same multiprocessors, does anyone know how nvcc handle this kind of declaration?

I think thats not completely true. You don’t load all the matrixes in the shared memory. then you have that limitation. This is good for submatrices i think.

Cheers, Jordy

Jordy, I don’t think I understand what you mean.

A submatrix in C (calculated by a thread block of 16*16) needs to load 1kb of submatrix of A and 1kb of submatrix of B into shared memory. So when there are 16kb shared memory per multiprocessors in GTX, and 16 multiprocessors, does it mean matrix C can only have 16/(1+1)*16=128 blocks? also each block has only 16 by 16 elements, which means the size of the matrix multiplication is very limited. Am i right?

Many thanks,