speedy CGEMM reaches 448 Gflop/s

I would like to report our CGEMM routine on GT200 GPUs. This work is an extended work of our SGEMM, which can be downloaded from
[url=“http://forums.nvidia.com/index.php?showtopic=159033”]http://forums.nvidia.com/index.php?showtopic=159033[/url] .

In this work, we can confirm dual issue effect of amazing pattern
“MOV reg, [sme]”
“MAD dest, src1, src2, src3”
“MAD dest, src1, src2, src3”

Figure 1 shows performance (Gflop/s) of our method on TeslaC1060, GTX285 and GTX295. The baseline is Volkov’s code on TeslaC1060 (black dash line).
our method reaches 448 Gflop/s on TeslaC1060 whereas CUBLAS (CUDA 2.3) only reaches 277.7 Gflop/s.


figure 1: comparison of CGEMM between our method and Volkov’s code

technical report and source code can be downloaded from
(1) technical report: [url=“http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/HandTunedCgemm_2010_v1.pdf”]404 Not Found
(2) source code: [url=“http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/lsc_cgemm_v2.zip”]404 Not Found

Hope that you can give me some comments and criticism.

Lung-Sheng Chien
HandTunedCgemm_2010_v1.pdf (1.6 MB)

“CUBALS” :)

Interesting read!