Dear all,
now I can edit assembly code for rank1-update myself, and then I write assembly code of Volkov’ code
such that “reg ← [smem]” and “MAD dest, src1, src2, src3” are interleaved, but performance is worse
than original volkov’s code.
However the pattern “reg ← [smem]” and two “MAD dest, src1, src2, src3” interleaving each other is amazing,
I use this pattern in CGEMM (matrix-multiplication of float), the performance is ncredible.
On TeslaC1060, compared to cuda 2.3
my method: 445.7 Gflop/s
CUDA 2.3: 227.7 Gflop/s
So far, I have no idea about this pattern. It indeed can speedup CGEMM and SGEMM.
From the Copyright information of CUDA CUBLAS Library 2.x, seems only SGEMM and DGEMM are using Volkov’s codes while CGEMM is not. I think perhaps the CUBLAS CGEMM code has not been heavily optimized yet. But after all, the performance you achieved is great.
If you can achieve over 66% of the peak performance with your code, then obviously there is no other explanation than dual issue. As you mentioned, there is one shared memory read for only two mads. Without dual issue, you are not able to achieve over 2/3 of the peak performance(only counting mads), and you did. I am curious, too as dual issue is somewhat like a myth. Maybe we need some professional answers.
Thanks! But it still does not work. I am using Ubuntu, and I don’t know whether it’s OS-related. Anyway, I manually modified my code to an extent similar to your method1_variant, and for N=4096, I got 485 GFLOPS on GTX285, compared to 425 GFLOPS with CUBLAS.
I tried 1 shared mem read for 4 mads, too, and it’s a little slower than CUBLAS. According to assembly generated by decuda, it’s bounded by low percentage of mad instructions. I wonder besides dual-issue or parallel shared mem/MAD, is volkov’s code or CUBLAS the best implementation?
2.I mean theoretically each thread computes 16x4 elements should deliever better performance without dual-issue. If we have more registers per SM, we could achieve that. Fermi seems have twice the number of registers than GTX200, but I dont have a card.
This looks great. A co-worker and I have an application where we could really benefit from this speed. BUT we’re both linux guys and are in need of a bit of assistance in how to use/compile this package. If anybody could throw up a makefile or jot down some instructions on how to call one of these routines from another cu file, that’d be great.
@LSChien: Thanks for the information. Upon further reading I see that you already had that written down somewhere. Sorry I didn’t read more carefully before. You are right though about the cubin files. I’m using CUDA 3.1 and a GTX480. The cubin files won’t load which I guess isn’t surprising. I will try it out on another machine with CUDA 2.3 and a C1060 card.
If at all possible, could you provide cubin files for compute capability 2.0? I had no trouble compiling with CUDA 3.1, so I’m curious if you compiled with 3.1 and specified -arch=sm_20 if that would solve this problem. I’d be happy to post GTX480 benchmarks if we could get that running.
Thanks for the info. I’ll give it a try. Even without Fermi capability the code is great. My production code has ~3X speed-up from using your cgemm. Thanks for providing it.
Hi,
I’m interested in these fast SGEMM implementations, but I’m mainly working on IPhone and its GPU ( PowerVR’s SGX ) do not support either CUDA or OpenCL, but do support shaders using OpenGLES 2.0.
Does any of you know a way to easily convert this Cuda/OpenCL SGEMM code into a GLSL code so that I could use it on the iPhone ?
Thanks !