Observations on possible cuBLAS regression in version 5 (vs version 4)

I ran the testing_dgemm module in magma-1.2.1, which will give the GFLOPS comparison between cublas and magma. I have 3 Tesla M2090 with compute capability 2.0 on a single node, and following is the reports:

FOR CUDA4:


device 0: Tesla M2090, 1301.0 MHz clock, 5375.4 MB memory, capability 2.0
device 1: Tesla M2090, 1301.0 MHz clock, 5375.4 MB memory, capability 2.0
device 2: Tesla M2090, 1301.0 MHz clock, 5375.4 MB memory, capability 2.0

Usage:
testing_dgemm [-NN|NT|TN|TT] [-N 1024]

Testing transA = N transB = N
M N K MAGMA GFLop/s CUBLAS GFlop/s error

1024 1024 1024 365.34 389.96 0.000000e+00
1280 1280 1280 375.16 399.27 0.000000e+00
1600 1600 1600 369.64 401.18 0.000000e+00
2000 2000 2000 362.72 397.91 0.000000e+00
2500 2500 2500 362.19 349.54 0.000000e+00
3125 3125 3125 375.99 348.08 0.000000e+00
3906 3906 3906 368.68 337.74 0.000000e+00
4882 4882 4882 374.77 374.42 5.684342e-14
6102 6102 6102 377.86 377.03 5.684342e-14

FOR CUDA5:


device 0: Tesla M2090, 1301.0 MHz clock, 5375.4 MB memory, capability 2.0
device 1: Tesla M2090, 1301.0 MHz clock, 5375.4 MB memory, capability 2.0
device 2: Tesla M2090, 1301.0 MHz clock, 5375.4 MB memory, capability 2.0

Usage:
testing_dgemm [-NN|NT|TN|TT] [-N 1024]

Testing transA = N transB = N
M N K MAGMA GFLop/s CUBLAS GFlop/s error

1024 1024 1024 366.40 390.66 0.000000e+00
1280 1280 1280 374.29 397.83 0.000000e+00
1600 1600 1600 369.01 400.59 0.000000e+00
2000 2000 2000 362.09 396.47 0.000000e+00
2500 2500 2500 361.64 347.95 0.000000e+00
3125 3125 3125 375.37 346.35 0.000000e+00
3906 3906 3906 368.10 338.53 0.000000e+00
4882 4882 4882 374.17 294.34 0.000000e+00
6102 6102 6102 377.27 303.73 0.000000e+00

It clearly showed that magma reported similar GFLOPS for CUDA4 and CUDA5, but cuBLAS has a lower GFLOPS in CUDA5 compared to CUDA4.

BTW, I also have other tests in my own package using cuBLAS dgemm, which showed the similar regression. Also, from the profiling file, it seems that the dtrmm module in cuBLAS5 has some calls from magma.

Thanks for any comments.