max matrix size in matrix multiplication matrix example in programming guide

timtimac · November 4, 2007, 11:24pm

Hi,

On page 64 of programming guide 1.0, it says “The maximum number of blocks that can run concurrently on a multiprocessor is 8” and “The maximum number of warps that can concurrently on a multiprocessor is 24”. Does it mean the maxiumum number of blocks is 8*16=128 on a 8800 GTX? Then, it greatly limits the size of the matrix multiplication on page 59-61.

Using that algorithm, one block is 1616 threads (and also the number of elements of submatrix), so even for A: 800800 and B:800800, it is too much. Because it has (800/16)(800/16) blocks(that is 2500).

I tried it on the gpu, but it still produce the right answer. However, if I try 80008000 multiply 80008000, it produce all 0 and never stop.

I want to test the FLOPS of GPU, so I need a big matrix multiplication. But with the constraints mentioned above, I can only make wA bigger. Something like 128800000 multiply 80000096. In that case, it produces the right answer, but if I change wA to 960000, then it is segmentation fault. Any idea why?

Many thanks,

Timtimac

MisterAnderson42 · November 5, 2007, 12:26am

Hi,

On page 64 of programming guide 1.0, it says “The maximum number of blocks that can run concurrently on a multiprocessor is 8” and “The maximum number of warps that can concurrently on a multiprocessor is 24”. Does it mean the maxiumum number of blocks is 8*16=128 on a 8800 GTX? Then, it greatly limits the size of the matrix multiplication on page 59-61.

Using that algorithm, one block is 1616 threads (and also the number of elements of submatrix), so even for A: 800800 and B:800800, it is too much. Because it has (800/16)(800/16) blocks(that is 2500).

I tried it on the gpu, but it still produce the right answer. However, if I try 80008000 multiply 80008000, it produce all 0 and never stop.

I want to test the FLOPS of GPU, so I need a big matrix multiplication. But with the constraints mentioned above, I can only make wA bigger. Something like 128800000 multiply 80000096. In that case, it produces the right answer, but if I change wA to 960000, then it is segmentation fault. Any idea why?

Many thanks,

Timtimac

[snapback]274964[/snapback]

The only limitation in the number of blocks that can be run is the grid dimensions (65535x 65535)… though I wouldn’t want to try and run such a big grid. The limit you quote from the guide is the maximum number that can run concurrently. That is, all of the warps are on a multiproc and being time sliced. The blocks are aren’t currently running sit in a scheduler pool or something and are started up on a multiproc when there is a slot.

For the matrix multiply example, you should be able to push the dimensions as big as there is memory for.

mfatica · November 5, 2007, 1:09am

If you want to benchmark SGEMM, use the cublas library.
It is faster and the size of the matrices is limited only by the amount of memory on the card. For best performance, the size should be a multiple of 32.

timtimac · November 5, 2007, 2:35am

“About pushing dimensions as big as there is memory for”, since a thread block of 1616 would use around 161642=2kb of shared memory, and there is only 16kb per multiprocessor, does that mean I can only have 16/2 * 16 = 128 blocks? Or I could have more than that because some sitting in the scheduler pool won’t count for the shared memory? (but we will not know for how long the blocks in the pool will wait, until the ones being time sliced all terminate?)

timtimac · November 5, 2007, 2:51am

on page 60 of programming guide,

__global__ void Muld(...) {

.......

for (int a = aBegin, b=bBegin; a<=aEnd; a+=aStep, b+=bStep) {

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

........

}

........

}

Does it matter whether I move the shared memory float array As and Bs before the for loop?

What I don’t understand here is that it seems to me every thread declare As and Bs many times instead of as a block declaration. Since shared memory is used by all block within the same multiprocessors, does anyone know how nvcc handle this kind of declaration?

jordyvaneijk · November 5, 2007, 7:51am

I think thats not completely true. You don’t load all the matrixes in the shared memory. then you have that limitation. This is good for submatrices i think.

Cheers, Jordy

timtimac · November 5, 2007, 10:28pm

Jordy, I don’t think I understand what you mean.

A submatrix in C (calculated by a thread block of 16*16) needs to load 1kb of submatrix of A and 1kb of submatrix of B into shared memory. So when there are 16kb shared memory per multiprocessors in GTX, and 16 multiprocessors, does it mean matrix C can only have 16/(1+1)*16=128 blocks? also each block has only 16 by 16 elements, which means the size of the matrix multiplication is very limited. Am i right?

Many thanks,

Timtimac

Topic		Replies	Views
Problems in deciding Gridsize & Blocksize for kernel CUDA Programming and Performance	13	8991	June 8, 2010
Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction CUDA Programming and Performance	14	13485	January 24, 2009
how to determine max number of blocks per kernel CUDA Programming and Performance	10	17378	September 11, 2011
How to use blocks CUDA Programming and Performance	1	3613	November 26, 2007
Example of matrix multiplication (max. block_size) CUDA Programming and Performance	2	11653	January 28, 2010
matrix multiplication CUDA Programming and Performance	10	3950	March 7, 2010
Max gridDim.x ? CUDA Programming and Performance	7	4617	March 11, 2010
Matrix-Matrix Multiplication Accuracy and Performance Questions CUDA Programming and Performance	13	6719	April 16, 2007
problem of matrix multiplication vector x matrix CUDA Programming and Performance	4	1302	August 22, 2010
matrix_mul with max_size CUDA Programming and Performance	1	1118	May 21, 2010

max matrix size in matrix multiplication matrix example in programming guide

Related topics