determining the most effective size of tile (e.g. matrix multiplication) algebra dense matrices

nyiotis · September 20, 2011, 4:03pm

[font=“arial, sans-serif”][font=“arial, verdana, tahoma, sans-serif”]Hi,

Hi, assume a simple matrix multiplication of square non-sparse matrices, e.g. A*B =C with n=m (n, m matrices size). [/font]

[font=“arial, verdana, tahoma, sans-serif”] [/font][font=“arial, verdana, tahoma, sans-serif”] [/font]

[font=“arial, verdana, tahoma, sans-serif”]The image attached s[/font][font=“arial, verdana, tahoma, sans-serif”]hows the performance gain by “chopping” the input matrices into tile blocks. I am not sure [/font][font=“arial, verdana, tahoma, sans-serif”]how to interpret this, i mean what is the underlying [/font][font=“arial, verdana, tahoma, sans-serif”]mechanism?

Why a division into a 2x2 or 4x4 tiles performs similarly to the naive [/font][font=“arial, verdana, tahoma, sans-serif”](that is, no tiles used) multiplication? Most importantly, why such a peak gained for [/font][font=“arial, verdana, tahoma, sans-serif”]the 16*16 tiling?

I understand that a 1616 needs 1616/32 = 8 warps (integer), [/font][font=“arial, verdana, tahoma, sans-serif”]but an 88 tiling also results in an integer warp, but with approximately 3 time less [/font][font=“arial, verdana, tahoma, sans-serif”]gain in GFLOP/s compared to the 1616 case.

Could this be related to the stride used, [/font][font=“arial, verdana, tahoma, sans-serif”]whenever it is an exact multiple of the warp? Any pointers to literature much appreciated[/font]

[font=“arial, verdana, tahoma, sans-serif”]

Thanks for any feedback!

[/font]

[font=“arial, verdana, tahoma, sans-serif”]N

ps. [font=“arial, verdana, tahoma, sans-serif”]" the asymptotic time complexity of all dense direct methods is O(n^3) for the (say LU) [/font][font=“arial, verdana, tahoma, sans-serif”]factorization and O(n^2) for solving the system based on the precomputed factorization"[/font][/font][/font]

nyiotis · September 20, 2011, 6:18pm

resolved *

Topic		Replies	Views
matrix mul diagram understanding CUDA Programming and Performance	3	2325	February 1, 2010
Example of matrix multiplication (max. block_size) CUDA Programming and Performance	2	11615	January 28, 2010
Why tiled MMM can only achieve around 40GFLOPS ? CUDA Programming and Performance	7	4235	March 24, 2008
Matrix-Matrix Multiplication Accuracy and Performance Questions CUDA Programming and Performance	13	6606	April 16, 2007
Tile /Tile width in matrix CUDA Programming and Performance	1	1351	May 14, 2015
Matrix multiplication shared memory CUDA Programming and Performance	5	7143	April 6, 2015
matrix multiplication for large matrices CUDA Programming and Performance	3	1602	August 22, 2011
Matrix multiplication woes large inner, small outer dimensions CUDA Programming and Performance	21	10137	March 24, 2009
2 problems with my matrix multiplication code CUDA Programming and Performance	5	1982	April 14, 2012
Example of Matrix multiplication CUDA Programming and Performance	1	1083	February 26, 2010

determining the most effective size of tile (e.g. matrix multiplication) algebra dense matrices

Related topics