Matrix-Matrix Multiplication Accuracy and Performance Questions

sicb0161 · April 12, 2007, 10:37am

Hi,

I have two questions concerning Matrix-Matrix Multiplication (used the Matmul example or directly the CUBLAS method). The tests were accomplished using randomly generated matrices with dimensions ranging from 16x16 to 4096x4096

When results of either the cublas method or the Matmul method are compared with the computeGold “simple” implementation, an average error of ca. 10e-4 and an error percentage between 40 to 50 %. How can this be explained?
When we take the peak performance of the G80 which is 345 GFlops and 230,4 GFlops respectively into consideration, the measured performance is a bit dissapointing, i.e. 65.5 GFlops (CUBLAS sgemm) or less using the Matmul example 30.2 GFlops. Why is that?

Cem

AlexTutubalin · April 12, 2007, 10:44am

Performance of matmul example is very suboptimal because not all possible shared memory (and register file) is used.

In block algorithms one should read two MxM sub-blocks into fastest memory, than
do 2*M^3 FLOPs. So the larger the M, the larger the OPs/reads ratio.

In matmul example only BLOCK_SIZE*BLOCK_SIZE data is read and BLOCK_DATA is limited by thread count per block. So, maximum possible BLOCK_SIZE is 22, but this value is unoptimal b/c hardware wants multiplies of 16.

With BLOCK_SIZE=16 matrixmul uses only 161642 = 2Kb (of 16) of shared memory per thread block and 3 blocks can run in parallel (limited by 768 threads per CTA). Also, each thread block uses 1616*4 = 1kb of register file for Csub (of course, registers are used for another variables)

So, we have 48Kb of fast memory (16kb shared memory and 32kb register file)
and use only 9kb of it (6kb shmem and 3kb of register file).

know nothing about cublas implementation.

paulius · April 12, 2007, 2:12pm

On my machine (G80 + Asus P5N32SLI Premium + 2.13GHz C2D) I’m getting 95-100Gflops with CUBLAS’ sgemm function for multiplication of square matrices, ranging from 512x512 to 4096x4096.

The reason peak Gflop performance is not achieved by sgemm is that matrix multiplication is memory intensive (ratio of memory accesses to arithmetic operations is high). You can reach higher Gflops with functions that perform more computation per memory access (same is true for CPUs as well).

Paulius

sicb0161 · April 12, 2007, 2:59pm

This is partly true because the number of thread blocks which can run in parallel is determined not only by the size of the blocks but also the registers per thread and the shared memory per block according to the occupancy calculator.

In the case of the Matrixmul the number of registers per thread is the limiting factor (i.e. 14) which determines the number of max thread blocks per multiprocessor which is 2!

However if it is more time saving to have more threads per block by choosing the blocksize as 32x16, the active thread blocks per multiprocessor reduces to 1 while the active warps per multi stays the same resulting in the same occupancy ratio.

The question here is wheather the thread scheduling (3.2 Execution Model, CUDA Guide) for warps needs less time than the time slicing method for blocks.

cem

sicb0161 · April 12, 2007, 3:03pm

What version do you have: GTX or GTS ?

Cem

sicb0161 · April 12, 2007, 3:04pm

What version do you have: GTX or GTS ?

Cem

mfatica · April 12, 2007, 3:11pm

These numbers are for a GTX.

The next release of CUBLAS will also increase the performance on SGEMM.

AlexTutubalin · April 12, 2007, 4:11pm

Non-square blocks is not very good idea.

For MxN blocks you’ll have 2MN reads and 2M^2N OPs (assuming M<N). So OPs/read ratio is equal to M.

shyam.pm · April 12, 2007, 7:06pm

Hi,

Does this mean that multiplying rectangular matrices cannot be as optimal as multiplying square matrices. If the application requires that a rectangular block size be chosen is there a way around the latency you have mentioned above?

Thanks,

Shyam

AlexTutubalin · April 13, 2007, 7:32am

The extreme case is vector-vector multiplication (i.e. 1xN * Nx1 marices :)

2*N reads
N writes
2*N FLOPs

So, OPs/memory = 2/3

You should choose block size as square as possible and as large as possible.

On x86 architecture replacing 1x1 blocks to 2x2 blocks results in 1.5x speedup.

Cyril_Zeller · April 13, 2007, 10:06am

How do you compute the error? The matrixMul sample from SDK 0.8.1 uses L2-norm and the error is below 1e-6.

matrixMul has been written for clarity of exposition to illustrate basic CUDA principles, not for performance (it’s actually the same code as Chapter 7 of the programming guide).

CUBLAS’ SGEMM is written for performance and will be improved in the next release as mentioned by Massimiliano.

As explained in this thread, perf for matrixMul can be increased by using a 32x32 block and have each thread process more matrix elements.

Cyril

sicb0161 · April 16, 2007, 9:08am

Thanks for the answer,

I am calculating the absolute differences between the gold and kernel version. For bigger matrices I had the concern that squaring and extracting the root might take too long.

A block with dimensions 32x32 would exceed the maximum allowed number of threads per multi, i.e 32x32 = 1024 > 768 = 32x16 which is the physical limit for the g80 as stated in the occupancy calculator?

According to the occupancy calculator three factors are determining the performance, shared memory per block, registers per thread and threads per block.

In the case of matrixmul the shared memory allocation and number of threads seem not to restrict the occupancy, but the number of registers used.

So, how could I increase the performance besides changing the block dimensions or is it the only way to do it?

cem

Cyril_Zeller · April 16, 2007, 1:22pm

Sorry, I was ambiguous in my previous post: By “32x32 block”, I meant “32x32 submatrix”.

And if you use a 32x32 submatrix, indeed, you can no longer maintain a one-to-one correspondence b/w threads and elements in the submatrix, as matrixMul does to make the code simpler. You need to have each thread process more matrix elements: You could have 256 threads that read and process 4 elements each, for example.

Cyril

sicb0161 · April 16, 2007, 1:42pm

ahhh, okay.

If one increases the submatrix dimensions, the number of transfers between the shared and global memory would be less and the locality of the cache is more exploited.

However this cannot be seen in microprocessor occupancy calculator, as the number of registers per thread is the limiting factor in the case of matrixmul. :wacko:

cem

Topic		Replies	Views
simpleCUBLAS vs matmul Implementation CUDA Programming and Performance	5	10006	April 10, 2007
Naive Matrix Multiplication: Confused about Optimal Threadblock Size CUDA Programming and Performance cuda	8	629	June 11, 2024
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18334	March 30, 2011
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7633	September 1, 2010
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28244	February 1, 2011
Matrix multiplication woes large inner, small outer dimensions CUDA Programming and Performance	21	10440	March 24, 2009
Low Occupancy / Performance cublasStrmv() with N < 4096 Matrices CUDA Programming and Performance	3	1555	April 23, 2013
Slow CUDA SGEMM CUDA Programming and Performance	5	795	September 15, 2022
Matrix multiplication performance issue CUDA Programming and Performance	14	291	June 12, 2025
Why different shape matrix multiplication have different performance? CUDA Programming and Performance	2	831	August 26, 2018

Matrix-Matrix Multiplication Accuracy and Performance Questions

Related topics