Problems in deciding Gridsize & Blocksize for kernel

Hi Experts,

          I have planned to buy GTX 285 graphics card. I have read at somewhere that GTX 285 could capable of processing 32k threads concurrently.But in my application i have used a kernel to multiply two matrices have size of width = 512, height = 1000. So totally i need 512k threads to process my kernel concurrently.

          So i want to know, whether i need to split my matrices and process it by multiple kernels or just i can use a single kernel to process it?

Note: If i can use a single kernel to process it, then how to allocate Grid size & Block size for that kernel? :unsure:

Thanks in Advance,


Actually the GTX 285 is capable of executing up to 308192=245,760 threads concurrently (depending of the resources needed per thread). And if that turns out to be insufficient, the remaining threads will automatically be launched once the first wave of threads has finished. The maximum number of threads in one kernel launch is 6553565535*512=2,198,956,147,200, so your needs are easily catered for.

Having said that, there are good reasons perfomance-wise for using fewer threads. Matrix multiplication is memory bound, so you will want to make optimal use of shared memory, which implies that you want to maximize the amount of work done within each block.

Thank you tera.

I understood 30 is number of multiprocessors. But i didn’t understand how the no. 8192 came. Could you please clarify it.i have read at some documents that,“The maximum number of threads that can run concurrently on a multiprocessor is 768.” :omg:


Matrix multiplication is limited by the amount of memory available in the card.

CUDA run-time is smart enough to schedule more blocks (in a queue kindaa arrangement) surpassing physical limits…

All you need to look @ is the limits for “blockDim.x, .y, .z” and “gridDim.x, .y, .z”…


Oops, my bad. I seem to have copied from the wrong table row. Although for the GTX 285, which has compute capability 1.3, the maximum value of active threads per multiprocessor actually is 1024 (768 is for 1.0 and 1.1 devices).

Anyway, the important parts of the answer were that kernels can have more threads than can execute concurrently, and that you should think about how you use shared memory to improve performance (which will then tell you how many blocks and threads you want).

K.Thank you tera. I think i have to read more documentations about how to efficiently use shared memory.

k Thank you Mr.Sarnath.

Now only i aware of this memory limitations.So, maximum matrix size is depends on memory available in the card. So please tell me if i use a 1GB graphics card then how big matrix i can process? (Consider matrix elements are float data type) Is it limited to 1GB / 4 bytes?

Any one please tell me, :ermm:

if i use a 1GB graphics card then how big matrix i can process? (Consider matrix elements are float data type) Is it limited to 1GB / 4 bytes?

Yes, roughly. A bit less as some memory is needed for the code, local and constant mem, the input and result vectors, primary surface (screen buffer) and other things the driver might need.

Well if you are multiplying two matrices and storing the product in a third matrix all in core, the limit is (available memory/12). Available memory is anything for about 50Mb less than the total memory on the card to zero, depending on whether the card is being shared with a display manager and what operating system you are using.

If you think slightly laterally, an “out-of-core” solution can be done by decomposing the product row wise. I have successfully multiplied two matrices together to form a 6Gb product matrix using a card with about 800Gb of free memory.

Thank you :rolleyes:

K.Thank you Mr.avidday,

I will come back to you after gaining some more knowledge in cuda programming.

I want one! ;)


I will happily sell it to you, but be warned, the price will probably containing as many orders of magnitude error as the memory size…