We developed one version of matrix multiplication by exploiting the bandwidth of constant memory.
The constant cache can reach about 2.4 Terabytes/s, if we can access it following good pattern: broadcast and sequential.
The matrix multiplication algorithm is A*B=C. The A and B are input matrices, and C is the output matrix.
We can partition A into multiple 128X16 tiles and put these tiles into constant memory.
Because the constant memory size limitation, we only use four 128X16 constant memories and run 4 kernels concurrently.
We also optimize the code for specific input size (the width of B is 4k, 5k, …, 8k) (mm_const_xK)
To try the code, put mm_const into “NVIDIA_GPU_Computing_SDK/C/src/”, compile the code with “make clean; make maxregisters=32”, and run the code “…/…/bin/linux/release/mm_const”
Some of the results on GTX 480 (cublas3.2 gets great improvement over cublas3.1) :
A (128, 1024), B (131072, 128), C (131072, 1024), cublas3.2(739gflops), const(1013gflops),
A (1024, 1024), B (131072, 1024), C (131072, 1024), cublas3.2(726gflops), const(952gflops),
A (7168, 7168), B (7168, 7168), C (7168, 7168), cublas3.2(823gflops), const(904gflops),