Example of matrix multiplication (max. block_size)

kemu · January 26, 2010, 9:12pm

Hi all!

The more I learn the more questions I haveâ€¦. :rolleyes:
I have studied the official example of matrix multiplication. To test the performances under different data tiling size, I have changed the BLOCK_SIZE of 1, 2, 4, 8, 16 and 32. The computer is crashed, if the data tiling size is at 32 (3232 = 1024 Threads in a block). I have a Quadro FX 1700 graphic card. The questions are:

Where can I know, which max. tiling size in a block can be supported by my graphic card? (How much data can I copy from device memory to shared memory at once?)
Is that right: The parallel computations in cuda can run just in a block (threads parallelism). The calculation of blocks is still sequential. It means, a block will run on GPU just after complete work of another block.
How about is the relation between blocks and the GPU architecture? A figure in the programming guide shows the parallel blocks. Is it a paradox to 2. statement (if its right.)?

If I have a larger matrix as the one in the example, how can I improve the example?

*here is the performance of fx1700

Quadro FX 1700
Memory Size 512MB
Memory Interface 128-bit
Graphic Memory Bandwidth 12.8 GB/sec.
Graphics Bus PCI Express 2.0
CUDA Parallel Processor Cores 32

Thx a lot!!

LSChien · January 27, 2010, 5:30pm

shared memory has 16kB per SM. if tile size is 32x32, then

two shared memory blocks As[32][32] and Bs[32][32] need 8kB,

so only one thread block can be put into SM.

( not 2 thread blocks per SM since parameters of kernel function

also occupy shared memory, so you can use <16KB shared memory)

however maximum number of threads per trhead block is 512, that’s why

your program is crashed.

basic unit of threads parallelism is a warp (32 threads), a thread block would

be divided into several warps and warp scheduler of SM would select one warp

to execute in 8 SPs by round-robin.

You cannot say “The calculation of blocks is still sequential”.

when a thread block is dispatched into one SM, then it would

not leave before work is complete.

kemu · January 28, 2010, 5:43pm

thx a lot for u help! External Media

The conditions for speeding up are more than i thought :rolleyes:

can I say, the important benchmarks to select a GPU for GPGPU are:

1 the bandwidth of device memory

2 the peak flops

size of SMs in a GPU (16 in G80)
max. nr of threads in a SM

3 the size of shared memory for a block (16kb?)

4 the number of clock cycles to dispatch an instruction for threads in a warp

best regards

Topic		Replies	Views
max matrix size in matrix multiplication matrix example in programming guide CUDA Programming and Performance	6	6961	November 5, 2007
increasing blokSize -> Faster or slower CUDA Programming and Performance	4	863	September 12, 2011
matrix multiplication benchmark CUDA Programming and Performance	8	4407	May 21, 2010
Grids and Threads question CUDA Programming and Performance	2	4421	August 7, 2007
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4492	October 24, 2008
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10263	July 7, 2009
matrix multiplication with large dimensions CUDA Programming and Performance	7	1581	April 9, 2011
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28043	February 1, 2011
The choose of grid size and block size CUDA Programming and Performance	8	2883	May 8, 2024
Problems in deciding Gridsize & Blocksize for kernel CUDA Programming and Performance	13	8808	June 8, 2010

Example of matrix multiplication (max. block_size)

Related topics