poor cgemm performance with cuda 3.0

mattb3 · April 14, 2010, 1:22pm

Has anyone else experienced abysmal cgemm performance with cuda 3.0? We’ve been using cuda 2.2 for awhile now in our application and recently upgraded to try 3.0. In one of our unit tests we do a cgemm benchmark using a large rectangle matrix. With cuda 2.2 we usually saw ~315 GFLOPS on a Tesla C1060. With 3.0 we now see something like 80 GFLOPS. Obviously something is wrong, but right now its a head scratcher. BTW, we are running on CentOS 5.4 x86_64 with the 195.36.15 drivers.

cbuchner1 · April 14, 2010, 2:25pm

In the CUDA 3.0 release thread it was mentioned there is a performance bug that affects 64 bit platforms (generating suboptimal code), which is going to be fixed in CUDA 3.1 Not sure if this is causing your troubles, but it’s definitely a hot candidate.

I wonder how such bugs make it into such important major releases. Maybe a second beta would have been in order?

A hotfix wouldn’t be a bad idea, either. I am sticking with CUDA 2.3 for a while, even backported an updated SDK sample recently.

mattb3 · April 14, 2010, 3:10pm

I should have mentioned that we are using the cublas cgemm, not a custom one. So if it’s a performance bug, then it affected the cublas library. Interestingly enough, the cublas Sgemm performance is still good. Sometimes I think cgemm gets no love.

avidday · April 14, 2010, 3:37pm

It might be interesting to profile your application under both CUDA 2.3 and CUDA 3.0 if you can. Just about all of the CUBLAS functions contain several kernel versions which are selected on the basis of problem size and data pitch to give optimal performance. It might well be that changes in design or tuning factors to accommodate compute 2.0 cards now mean that your cgemm calls are running on a different code path than before which is a lot slower on your hardware. The profiler can show you the kernels being used and the execution parameters being used. You might be able to spot the regression there.

I must admit that I now have CUBLAS 3.0 on my development box, but I haven’t had time to benchmark it. Our production codes are still running on CUBLAS 2.3.

E.D_Riedijk · April 14, 2010, 5:17pm

Maybe it is an idea for nvidia to make a testing-framework in which people can put code they find very important. If the testing framework is made in such a way that lower performance as before is detected, they can run those tests before releasing a new version.
I know that there are other companies that offer this kind of support.

philippev · April 14, 2010, 8:47pm

Can you specify your matrices sizes ( dimensions m,n,k) and the mode ( transa,transb)

Lev · April 14, 2010, 8:57pm

[deleted]

mattb3 · April 15, 2010, 2:05am

I can not remember exactly off the top of my head but roughly m=32768, n=128, and k=512, transa and transb are both set to “N” (no transposes). As far as sizes go, I do know we made sure to keep each dimension a multiple of 64.

mattb3 · April 15, 2010, 2:05am

I can not remember exactly off the top of my head but roughly m=32768, n=128, and k=512, transa and transb are both set to “N” (no transposes). As far as sizes go, I do know we made sure to keep each dimension a multiple of 64.

philippev · April 16, 2010, 11:50pm

We were able to reproduce the issue. We are looking into it.

Thanks,

Phil.

philippev · April 16, 2010, 11:50pm

We were able to reproduce the issue. We are looking into it.

Thanks,

Phil.

Kyle_Spagnoli · April 23, 2010, 6:16pm

Related: Single precision SYMM is suffering the same issue with 64-bit, CUBLAS 3.0, and G200 based GPUs.

eelsen · June 2, 2010, 10:31pm

Just thought I’d add more info since I just hit this issue and didn’t find this forum post until after I had done some investigating. With 3.0 performance drops significantly when both of the dimensions are a multiples of 16 compared with <2.3 when performance is best when the dimensions are multiples of 16. 3.0 is also just slower on average as well. I’m attaching a graph comparing a system with 3.0 and a GTX470 (blue) vs. a system with 2.3 and a 8800 (green). The matrices are (256 x m) * (m * 256); x axis is m, y axis is time in millsecs.

The worst difference can be a factor of 4, maybe even more.

Any word on if this will be fixed in 3.1?

Topic		Replies	Views
CUBLAS 3.0 DGEMM performance on Tesla Fermi CUDA Programming and Performance	1	11477	May 14, 2010
CGEMM problems CUDA Programming and Performance	14	6797	February 2, 2011
strange things of cublas 3.0 on RHEL 5.3 CUDA Programming and Performance	0	1166	April 8, 2010
strange things of cublas 3.0 on RHEL 5.3 CUDA Programming and Performance	0	2721	April 8, 2010
cublasDgemm returns wrong results for large matrix dimensions? CUDA Programming and Performance	12	3355	November 30, 2010
dgemm performance in 3.2 vs 4.0 rc2 CUDA Programming and Performance	2	837	April 21, 2011
strange things of cublas 3.0 on RHEL 5.3 CUDA Programming and Performance	2	3889	April 14, 2010
cuda 3.2 slower than cuda 2.0 ? CUDA Programming and Performance	11	4489	November 3, 2010
Slow CUDA SGEMM CUDA Programming and Performance	5	784	September 15, 2022
Performance query Odd results profiling GPU speed of matrix multiplication using cublas CUDA Programming and Performance	1	1512	February 12, 2010

poor cgemm performance with cuda 3.0

Related topics