Hand-Tuned SGEMM on GT200 GPU 10% ~ 20% improvement of SGEMM

LSChien · March 3, 2010, 3:58pm

Dear all,
now I can edit assembly code for rank1-update myself, and then I write assembly code of Volkov’ code
such that “reg ← [smem]” and “MAD dest, src1, src2, src3” are interleaved, but performance is worse
than original volkov’s code.

However the pattern “reg ← [smem]” and two “MAD dest, src1, src2, src3” interleaving each other is amazing,
I use this pattern in CGEMM (matrix-multiplication of float), the performance is ncredible.

On TeslaC1060, compared to cuda 2.3
my method: 445.7 Gflop/s
CUDA 2.3: 227.7 Gflop/s

So far, I have no idea about this pattern. It indeed can speedup CGEMM and SGEMM.

avidday · March 3, 2010, 4:15pm

That sounds very promising. Did you manage to get inline assembly working, or are you still doing it by hand patching compiler output?

LSChien · March 4, 2010, 1:22am

so far, I do following five steps

step 1: use nvcc to obtain nvcc.cubin

step 2: use decuda to obtain .asm

step 3: use cudasm to obtain cudasm.cubin,

use command “diff” to compare nvcc.cubin and cudasm.cubin, and then

check which part of difference could generate errors.

(error occurs at s[mem] ← reg )

step 4: cut “c = a*b” part in .asm (whole .asm is divided into three parts)

and write a program to generate desired assembly codes (one can design different pattern here).

Then combine two unchanged parts into one .asm

Remark: one should identity register ID with variables in the program to make sure

the set of free registers.

step 5: use cudasm to assemble modified .asm to .cubin, and replace difference parts in step 3.

So far, inline assembly codes in PTX file would be destroyed by nvcc because compiler does optimization itself.

I think that if NVIDIA can deliver official decuda/cudasm, then we can use inline assembly to achieve what I do.

cuda2010 · March 4, 2010, 6:33am

From the Copyright information of CUDA CUBLAS Library 2.x, seems only SGEMM and DGEMM are using Volkov’s codes while CGEMM is not. I think perhaps the CUBLAS CGEMM code has not been heavily optimized yet. But after all, the performance you achieved is great.

LSChien · March 4, 2010, 2:22pm

I agree with you.

in fact, I modify volkov’s code on CGEMM such that compiler nvcc avoids using local memory.

However I keep configuration of sub-blocks in Volkov’s code.

What I do is just edit different pattern in rank-1 update.

Volkov’s code is better than CUBLAS 2.3

CUBLAS 2.3: 277.7 Gflop/s

Volkov’s code: 348.5 Gflop/s

I think that Volkov’s code has potential to reach 500Gflop/s, but I can not do that so far.

sheepy13 · March 8, 2010, 7:50pm

I would like to report our SGEMM routine on GT200 GPU. This work inherits Volkov’s code, which can be downloaded from

http://forums.nvidia.com/index.php?showtopic=89084

Figure 1 shows performance (Gflop/s) of our method on TeslaC1060, GTX285 and GTX295. The baseline is Volkov’s code on TeslaC1060 (black dash line).

Core frequency of GTX285 is 1.135x than that of TeslaC1060, and it is reasonable that performance of GTX285 is 1.165x than that of TeslaC1060.

Maximum performance of our method on TeslaC1060 in Figure 1 is 439.467Gflop/s, this number achieves 70% of peak performance without dual issue

(peak performance of single precision without dual issue on TeslaC1060 is 624 Gflop/s) whereas Volkov’s code reaches 346 Gflop/s, which is 55.45% of peak performance.

740×332 38.8 KB

figure 1: comparison between our method and Volkov’s code

technical report and source code can be downloaded from

(1) technical report: http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEM…m_2010_v1.1.pdf

(2) source code: http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEMM/lsc_sgemm.zip

Hope that you can give me some comments and criticism.

Lung Sheng

If you can achieve over 66% of the peak performance with your code, then obviously there is no other explanation than dual issue. As you mentioned, there is one shared memory read for only two mads. Without dual issue, you are not able to achieve over 2/3 of the peak performance(only counting mads), and you did. I am curious, too as dual issue is somewhat like a myth. Maybe we need some professional answers.

sheepy13 · March 8, 2010, 9:38pm

I downloaded your code, but cannot get it running.

check_sgemm_suqare.cpp is missing. Anyway, I copied it from check_general_sgemm_suqare.cpp and changed function name in it, which seemed fine.
Cublas test is fine, but all the rest versions can not run. There is launch failure.
device 0, device name = GeForce GTX 285

profile C = A*B on square matrices

Succ: load module method1/method1.cubin
Succ: load kernel method1_sgemmNN
cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <profile_sgemm_suqare.cpp>, line 232 : unspecified launch failure.

Line 232 is cudaThreadSynchronize() right after kernel launch.
Did I do something wrong?

LSChien · March 9, 2010, 1:33am

check_sgemm_suqare.cpp is of no use in the final version, I am sorry that typing error occurs in figure 10 of ReadMe.pdf.

( check_sgemm_suqare.cpp and check_general_sgemm_suqare.cpp only differ on device allocation sequence,

check_sgemm_suqare.cpp: allocate A first, then B, finally C.

check_general_sgemm_suqare.cpp: allocate C, then A, finally B)

Please note that method 1 ~ 4 don’t consider out-of-array bound.

suppose we use following code to test method 1,

int main(int argc, char** argv)

{

	int device_num = 0; // GTX285

	initContext_Drv( device_num ); // initial context

	profile_sgemm_square("../method1/method1.cubin", "method1_sgemmNN", 

		&method1_DrvWrapper, cat(OUTPUT_DIR,"method1/threads320.txt") );

		

  return 0;

}

then I re-test on my machine, vista64 + GTX285, it works.

However method 1 should not work on n = 5, since out-of-array bound occurs.

It works on my machine but may not work on yours.

Step 1: I want to make sure that your error comes from out-of-array bound for n = 5.

please modify for-loop in function “profile_sgemm_square” of profile_sgemm_suqare.cpp,

for( n1 = 5; n1 <=  4096; n1++){

		profile_sgemm_square_unit(sgemmWrapper, sgemm,  

					n1, n1, n1, alpha, beta,

					gpuTime, flop_rate, rel_max_err );

to

for( n1 = 64; n1 <=  4096; n1 += 64 ){

		profile_sgemm_square_unit(sgemmWrapper, sgemm,  

					n1, n1, n1, alpha, beta,

					gpuTime, flop_rate, rel_max_err );

If stride 64 does not work, please use stride 128.

(stride 128 must work if the error is purely out-of-array bound. )

Step 2: suppose error comes from out-of-array bound, and you want to test method 1 ~ method 4 for any dimension,

I will suggest that you can allocate bigger device memory d_A, d_B, d_C in function “profile_sgemm_square_unit”

of profile_sgemm_suqare.cpp, for example,

unsigned int lda_64 = 64 * ((lda + 63)/64); // lda_64 is multiple of 64 

  unsigned int ldb_16 = 16 * ((ldb + 15)/16); // ldb_64 is multiple of 16 

  unsigned int ldc_64 = 64 * ((ldc + 63)/64); // ldc_64 is multiple of 64

  unsigned int k_16   = 16 * ((k+15)/16); // k_16 is multiple of 16;

  unsigned int n_16   = 16 * ((n+15)/16); // n_16 is multiple of 16;

	unsigned int size_A = lda_64*k_16;

	unsigned int mem_size_A = sizeof(float) * size_A;

	h_A = (float*) malloc(mem_size_A); assert( h_A );

	unsigned int size_B = ldb_16*n_16;

	unsigned int mem_size_B = sizeof(float) * size_B;

	h_B = (float*) malloc(mem_size_B); assert( h_B );

	// allocate host memory for the result

	unsigned int size_C = ldc_64*n_16;

	unsigned int mem_size_C = sizeof(float) * size_C;

	h_C = (float*) malloc(mem_size_C); assert( h_C );

Step 3: method 5 ~ method 8 take care of out-of-array bound, you should use

function “profile_general_sgemm_square” to test it.

If error occurs in method 5 ~ method 8, please tell me.

Thanks.

sheepy13 · March 9, 2010, 2:35am

Thanks! But it still does not work. I am using Ubuntu, and I don’t know whether it’s OS-related. Anyway, I manually modified my code to an extent similar to your method1_variant, and for N=4096, I got 485 GFLOPS on GTX285, compared to 425 GFLOPS with CUBLAS.
I tried 1 shared mem read for 4 mads, too, and it’s a little slower than CUBLAS. According to assembly generated by decuda, it’s bounded by low percentage of mad instructions. I wonder besides dual-issue or parallel shared mem/MAD, is volkov’s code or CUBLAS the best implementation?

LSChien · March 9, 2010, 6:07am

my code is developed under 64-bit machine, do you use 32-bit OS?

If so, my code does not work since host code is compiled as 32-bit, size of pointer is 32-bit,

but CUDA binary file assumes that size of pointer is 64-bit.

I have tested my code on

(1) Fedora 10 x86_64, driver:190.18, cuda 2.3

(2) vista 64, winxp 64, driver:190.38, cuda 2.3

If your system is 64-bit, would you please describe version of OS, driver and cuda library you used?

I may try to find a machine to install your configuration and then test the code.

second, the best case in SGEMM is “MAD dest, src1, src2, src3” without any data movement from shared memory to register.

However matrix B must be shared by all threads, so “movement from shared memory to register” can not be avoided.

I think idea of Volkov’s reaches optimal.

sheepy13 · March 9, 2010, 10:44am

my code is developed under 64-bit machine, do you use 32-bit OS?

If so, my code does not work since host code is compiled as 32-bit, size of pointer is 32-bit,

but CUDA binary file assumes that size of pointer is 64-bit.

I have tested my code on

(1) Fedora 10 x86_64, driver:190.18, cuda 2.3

(2) vista 64, winxp 64, driver:190.38, cuda 2.3

If your system is 64-bit, would you please describe version of OS, driver and cuda library you used?

I may try to find a machine to install your configuration and then test the code.

second, the best case in SGEMM is “MAD dest, src1, src2, src3” without any data movement from shared memory to register.

However matrix B must be shared by all threads, so “movement from shared memory to register” can not be avoided.

I think idea of Volkov’s reaches optimal.

1.I am using 32-bit OS, and I think that’s why.

2.I mean theoretically each thread computes 16x4 elements should deliever better performance without dual-issue. If we have more registers per SM, we could achieve that. Fermi seems have twice the number of registers than GTX200, but I dont have a card.

LSChien · March 9, 2010, 1:17pm

I think that one work-around may work, you can try.

in method1_DrvWrapper.cpp, you can extend 32-bit pointer to 64-bit pointer by zero extension.

that is, for example, modify

// parameter 3: const float *A 

	ptr = (void*)(size_t)A;

	ALIGN_UP(offset, __alignof(ptr));

	cuParamSetv(hfunc, offset, &ptr, sizeof(ptr));

	offset += sizeof( ptr );

to

// parameter 3: const float *A 

  void* ptr_array[2];

  ptr_array[0] = (void*) NULL;

	ptr_array[1] = (void*)(size_t)A;

	ALIGN_UP(offset, __alignof(ptr_array));

	cuParamSetv(hfunc, offset, ptr_array, sizeof(ptr_array) );

	offset += sizeof( ptr_array );

and you can do the same workaround for “float *B” and “float *C”.

You can this work-around.

yes, you are right.

p_vs_np · June 10, 2010, 4:06am

Could you please check the link provided. I’m unable to read the paper…

I would like to report our SGEMM routine on GT200 GPU. This work inherits Volkov’s code, which can be downloaded from

http://forums.nvidia.com/index.php?showtopic=89084

Figure 1 shows performance (Gflop/s) of our method on TeslaC1060, GTX285 and GTX295. The baseline is Volkov’s code on TeslaC1060 (black dash line).

Core frequency of GTX285 is 1.135x than that of TeslaC1060, and it is reasonable that performance of GTX285 is 1.165x than that of TeslaC1060.

Maximum performance of our method on TeslaC1060 in Figure 1 is 439.467Gflop/s, this number achieves 70% of peak performance without dual issue

(peak performance of single precision without dual issue on TeslaC1060 is 624 Gflop/s) whereas Volkov’s code reaches 346 Gflop/s, which is 55.45% of peak performance.

740×332 38.8 KB

figure 1: comparison between our method and Volkov’s code

technical report and source code can be downloaded from

(1) technical report: http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEM…m_2010_v1.1.pdf

(2) source code: http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEMM/lsc_sgemm.zip

Hope that you can give me some comments and criticism.

Lung Sheng

faircdl · June 25, 2010, 8:22pm

Hi guys,

This looks great. A co-worker and I have an application where we could really benefit from this speed. BUT we’re both linux guys and are in need of a bit of assistance in how to use/compile this package. If anybody could throw up a makefile or jot down some instructions on how to call one of these routines from another cu file, that’d be great.

Thanks.

LSChien · June 26, 2010, 6:14am

My CUDA code is compiled to binary file (.cubin), you don’t need to use nvcc to compile .cu file.

Just compile .cpp files by gcc or icpc, and put .cubin into the same directory such that

program can load .cubin at runtime.

However this code is built under CUDA 2.3

now CUDA 3.0 may change binary format, I have no idea that if my .cubin works for CUDA 3.0 or not.

I am sorry that I don’t test it.

faircdl · June 26, 2010, 12:51pm

@LSChien: Thanks for the information. Upon further reading I see that you already had that written down somewhere. Sorry I didn’t read more carefully before. You are right though about the cubin files. I’m using CUDA 3.1 and a GTX480. The cubin files won’t load which I guess isn’t surprising. I will try it out on another machine with CUDA 2.3 and a C1060 card.

If at all possible, could you provide cubin files for compute capability 2.0? I had no trouble compiling with CUDA 3.1, so I’m curious if you compiled with 3.1 and specified -arch=sm_20 if that would solve this problem. I’d be happy to post GTX480 benchmarks if we could get that running.

Thanks again. Really great work.

wumpus · June 27, 2010, 8:55pm

Incredible, so you changed the generated microcode using cudasm/decuda to speed up the SGEMM functions. Awesome work!

LSChien · June 28, 2010, 2:12am

In fact, my approach depends on power of decuda/cudasm in http://wiki.github.com/laanwj/decuda/

since decuda/cudasm cannot support sm 2.0, I cannot write assembly code myself.

I think you have two ways

MAGMA (http://icl.cs.utk.edu/magma/index.html) release new version and it includes sgemm,

according to its document,

Two main techniques distinguishing the MAGMA BLAS Library are

(1)Auto-tuners

(2)Pointer redirecting

so MAGMA would provide good and uniform performance on sgemm.

You can try it.

you can also use my code but now it only works for C = alphaAB + beta*C.

unless NVIDIA releases its decuda, we cannot write assembly code in cuda3.0

I would like to use nvcc to compile my code and hope that compiler can solve

“volatile” issue which is not solved in cuda 2.3

My suggestion, you can use method6_variant. in method6_variant.cu

modify declaration in line 156,

float  b_reg;

to

volatile float  b_reg;

and line 188

b_reg = b_ptr[j] * 4.0f;

to

b_reg = b_ptr[j];

Then you can compile .cu to cubin file and then use driver API or

just call kernel function in your .cpp file.

The work I do is to make sure “volatile” is indeed “volatile”.

If nvcc in CUDA 3.0 can do that, no need to write assembly code manually.

However I have no idea if nvcc can do that or not, because we have no decuda on CUDA 3.0

If you determine to use my code and you cannot make it work. Tell me, I will make it work.

faircdl · June 28, 2010, 2:34am

Thanks for the info. I’ll give it a try. Even without Fermi capability the code is great. My production code has ~3X speed-up from using your cgemm. Thanks for providing it.

jpiovano · March 1, 2011, 12:10am

Hi,
I’m interested in these fast SGEMM implementations, but I’m mainly working on IPhone and its GPU ( PowerVR’s SGX ) do not support either CUDA or OpenCL, but do support shaders using OpenGLES 2.0.
Does any of you know a way to easily convert this Cuda/OpenCL SGEMM code into a GLSL code so that I could use it on the iPhone ?
Thanks !

Topic		Replies	Views
my speedy SGEMM CUDA Programming and Performance	91	275908	May 29, 2013
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26919	December 23, 2014
Bitslice-DES optimization CUDA Programming and Performance	55	12607	January 29, 2022
How to optimize this code (whirlpool hash)? Independant parallel hashing with whirlpool is way too s CUDA Programming and Performance	28	2991	September 14, 2010
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11157	May 23, 2010
Tesla S2050 performance double precision performance too low CUDA Programming and Performance	42	29123	December 8, 2010
AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance CUDA Programming and Performance	70	47280	July 2, 2013
emu vs debug, different values CUDA Programming and Performance	48	15728	February 5, 2009
My simple but speedy reduction code (runs 106.4GB/s on GTX 295) 106.4/111.9=95.1% to the peak bandwi CUDA Programming and Performance	32	28214	August 15, 2010
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9591	December 4, 2017

Hand-Tuned SGEMM on GT200 GPU 10% ~ 20% improvement of SGEMM

profile C = A*B on square matrices

Related topics