poor cgemm performance with cuda 3.0

Has anyone else experienced abysmal cgemm performance with cuda 3.0? We’ve been using cuda 2.2 for awhile now in our application and recently upgraded to try 3.0. In one of our unit tests we do a cgemm benchmark using a large rectangle matrix. With cuda 2.2 we usually saw ~315 GFLOPS on a Tesla C1060. With 3.0 we now see something like 80 GFLOPS. Obviously something is wrong, but right now its a head scratcher. BTW, we are running on CentOS 5.4 x86_64 with the 195.36.15 drivers.

In the CUDA 3.0 release thread it was mentioned there is a performance bug that affects 64 bit platforms (generating suboptimal code), which is going to be fixed in CUDA 3.1 Not sure if this is causing your troubles, but it’s definitely a hot candidate.

I wonder how such bugs make it into such important major releases. Maybe a second beta would have been in order?

A hotfix wouldn’t be a bad idea, either. I am sticking with CUDA 2.3 for a while, even backported an updated SDK sample recently.

I should have mentioned that we are using the cublas cgemm, not a custom one. So if it’s a performance bug, then it affected the cublas library. Interestingly enough, the cublas Sgemm performance is still good. Sometimes I think cgemm gets no love.

It might be interesting to profile your application under both CUDA 2.3 and CUDA 3.0 if you can. Just about all of the CUBLAS functions contain several kernel versions which are selected on the basis of problem size and data pitch to give optimal performance. It might well be that changes in design or tuning factors to accommodate compute 2.0 cards now mean that your cgemm calls are running on a different code path than before which is a lot slower on your hardware. The profiler can show you the kernels being used and the execution parameters being used. You might be able to spot the regression there.

I must admit that I now have CUBLAS 3.0 on my development box, but I haven’t had time to benchmark it. Our production codes are still running on CUBLAS 2.3.

Maybe it is an idea for nvidia to make a testing-framework in which people can put code they find very important. If the testing framework is made in such a way that lower performance as before is detected, they can run those tests before releasing a new version.
I know that there are other companies that offer this kind of support.

Can you specify your matrices sizes ( dimensions m,n,k) and the mode ( transa,transb)

[deleted]

I can not remember exactly off the top of my head but roughly m=32768, n=128, and k=512, transa and transb are both set to “N” (no transposes). As far as sizes go, I do know we made sure to keep each dimension a multiple of 64.

I can not remember exactly off the top of my head but roughly m=32768, n=128, and k=512, transa and transb are both set to “N” (no transposes). As far as sizes go, I do know we made sure to keep each dimension a multiple of 64.

We were able to reproduce the issue. We are looking into it.

Thanks,

Phil.

We were able to reproduce the issue. We are looking into it.

Thanks,

Phil.

Related: Single precision SYMM is suffering the same issue with 64-bit, CUBLAS 3.0, and G200 based GPUs.

Just thought I’d add more info since I just hit this issue and didn’t find this forum post until after I had done some investigating. With 3.0 performance drops significantly when both of the dimensions are a multiples of 16 compared with <2.3 when performance is best when the dimensions are multiples of 16. 3.0 is also just slower on average as well. I’m attaching a graph comparing a system with 3.0 and a GTX470 (blue) vs. a system with 2.3 and a 8800 (green). The matrices are (256 x m) * (m * 256); x axis is m, y axis is time in millsecs.

The worst difference can be a factor of 4, maybe even more.

Any word on if this will be fixed in 3.1?