I have two questions concerning Matrix-Matrix Multiplication (used the Matmul example or directly the CUBLAS method). The tests were accomplished using randomly generated matrices with dimensions ranging from 16x16 to 4096x4096
When results of either the cublas method or the Matmul method are compared with the computeGold “simple” implementation, an average error of ca. 10e-4 and an error percentage between 40 to 50 %. How can this be explained?
When we take the peak performance of the G80 which is 345 GFlops and 230,4 GFlops respectively into consideration, the measured performance is a bit dissapointing, i.e. 65.5 GFlops (CUBLAS sgemm) or less using the Matmul example 30.2 GFlops. Why is that?
Performance of matmul example is very suboptimal because not all possible shared memory (and register file) is used.
In block algorithms one should read two MxM sub-blocks into fastest memory, than
do 2*M^3 FLOPs. So the larger the M, the larger the OPs/reads ratio.
In matmul example only BLOCK_SIZE*BLOCK_SIZE data is read and BLOCK_DATA is limited by thread count per block. So, maximum possible BLOCK_SIZE is 22, but this value is unoptimal b/c hardware wants multiplies of 16.
With BLOCK_SIZE=16 matrixmul uses only 161642 = 2Kb (of 16) of shared memory per thread block and 3 blocks can run in parallel (limited by 768 threads per CTA). Also, each thread block uses 1616*4 = 1kb of register file for Csub (of course, registers are used for another variables)
So, we have 48Kb of fast memory (16kb shared memory and 32kb register file)
and use only 9kb of it (6kb shmem and 3kb of register file).
On my machine (G80 + Asus P5N32SLI Premium + 2.13GHz C2D) I’m getting 95-100Gflops with CUBLAS’ sgemm function for multiplication of square matrices, ranging from 512x512 to 4096x4096.
The reason peak Gflop performance is not achieved by sgemm is that matrix multiplication is memory intensive (ratio of memory accesses to arithmetic operations is high). You can reach higher Gflops with functions that perform more computation per memory access (same is true for CPUs as well).
This is partly true because the number of thread blocks which can run in parallel is determined not only by the size of the blocks but also the registers per thread and the shared memory per block according to the occupancy calculator.
In the case of the Matrixmul the number of registers per thread is the limiting factor (i.e. 14) which determines the number of max thread blocks per multiprocessor which is 2!
However if it is more time saving to have more threads per block by choosing the blocksize as 32x16, the active thread blocks per multiprocessor reduces to 1 while the active warps per multi stays the same resulting in the same occupancy ratio.
The question here is wheather the thread scheduling (3.2 Execution Model, CUDA Guide) for warps needs less time than the time slicing method for blocks.
Does this mean that multiplying rectangular matrices cannot be as optimal as multiplying square matrices. If the application requires that a rectangular block size be chosen is there a way around the latency you have mentioned above?
How do you compute the error? The matrixMul sample from SDK 0.8.1 uses L2-norm and the error is below 1e-6.
matrixMul has been written for clarity of exposition to illustrate basic CUDA principles, not for performance (it’s actually the same code as Chapter 7 of the programming guide).
CUBLAS’ SGEMM is written for performance and will be improved in the next release as mentioned by Massimiliano.
As explained in this thread, perf for matrixMul can be increased by using a 32x32 block and have each thread process more matrix elements.
I am calculating the absolute differences between the gold and kernel version. For bigger matrices I had the concern that squaring and extracting the root might take too long.
A block with dimensions 32x32 would exceed the maximum allowed number of threads per multi, i.e 32x32 = 1024 > 768 = 32x16 which is the physical limit for the g80 as stated in the occupancy calculator?
According to the occupancy calculator three factors are determining the performance, shared memory per block, registers per thread and threads per block.
In the case of matrixmul the shared memory allocation and number of threads seem not to restrict the occupancy, but the number of registers used.
So, how could I increase the performance besides changing the block dimensions or is it the only way to do it?
Sorry, I was ambiguous in my previous post: By “32x32 block”, I meant “32x32 submatrix”.
And if you use a 32x32 submatrix, indeed, you can no longer maintain a one-to-one correspondence b/w threads and elements in the submatrix, as matrixMul does to make the code simpler. You need to have each thread process more matrix elements: You could have 256 threads that read and process 4 elements each, for example.
If one increases the submatrix dimensions, the number of transfers between the shared and global memory would be less and the locality of the cache is more exploited.
However this cannot be seen in microprocessor occupancy calculator, as the number of registers per thread is the limiting factor in the case of matrixmul. :wacko: