Hand-Tuned SGEMM on GT200 GPU 10% ~ 20% improvement of SGEMM

Dear all,
now I can edit assembly code for rank1-update myself, and then I write assembly code of Volkov’ code
such that “reg ← [smem]” and “MAD dest, src1, src2, src3” are interleaved, but performance is worse
than original volkov’s code.

However the pattern “reg ← [smem]” and two “MAD dest, src1, src2, src3” interleaving each other is amazing,
I use this pattern in CGEMM (matrix-multiplication of float), the performance is ncredible.

On TeslaC1060, compared to cuda 2.3
my method: 445.7 Gflop/s
CUDA 2.3: 227.7 Gflop/s

So far, I have no idea about this pattern. It indeed can speedup CGEMM and SGEMM.

That sounds very promising. Did you manage to get inline assembly working, or are you still doing it by hand patching compiler output?

so far, I do following five steps

step 1: use nvcc to obtain nvcc.cubin

step 2: use decuda to obtain .asm

step 3: use cudasm to obtain cudasm.cubin,

use command “diff” to compare nvcc.cubin and cudasm.cubin, and then

check which part of difference could generate errors.

(error occurs at s[mem] ← reg )

step 4: cut “c = a*b” part in .asm (whole .asm is divided into three parts)

and write a program to generate desired assembly codes (one can design different pattern here).

Then combine two unchanged parts into one .asm

Remark: one should identity register ID with variables in the program to make sure

the set of free registers.

step 5: use cudasm to assemble modified .asm to .cubin, and replace difference parts in step 3.

So far, inline assembly codes in PTX file would be destroyed by nvcc because compiler does optimization itself.

I think that if NVIDIA can deliver official decuda/cudasm, then we can use inline assembly to achieve what I do.

From the Copyright information of CUDA CUBLAS Library 2.x, seems only SGEMM and DGEMM are using Volkov’s codes while CGEMM is not. I think perhaps the CUBLAS CGEMM code has not been heavily optimized yet. But after all, the performance you achieved is great.

I agree with you.

in fact, I modify volkov’s code on CGEMM such that compiler nvcc avoids using local memory.

However I keep configuration of sub-blocks in Volkov’s code.

What I do is just edit different pattern in rank-1 update.

Volkov’s code is better than CUBLAS 2.3

CUBLAS 2.3: 277.7 Gflop/s

Volkov’s code: 348.5 Gflop/s

I think that Volkov’s code has potential to reach 500Gflop/s, but I can not do that so far.

If you can achieve over 66% of the peak performance with your code, then obviously there is no other explanation than dual issue. As you mentioned, there is one shared memory read for only two mads. Without dual issue, you are not able to achieve over 2/3 of the peak performance(only counting mads), and you did. I am curious, too as dual issue is somewhat like a myth. Maybe we need some professional answers.

I downloaded your code, but cannot get it running.

  1. check_sgemm_suqare.cpp is missing. Anyway, I copied it from check_general_sgemm_suqare.cpp and changed function name in it, which seemed fine.
  2. Cublas test is fine, but all the rest versions can not run. There is launch failure.
    device 0, device name = GeForce GTX 285

profile C = A*B on square matrices

Succ: load module method1/method1.cubin
Succ: load kernel method1_sgemmNN
cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <profile_sgemm_suqare.cpp>, line 232 : unspecified launch failure.

Line 232 is cudaThreadSynchronize() right after kernel launch.
Did I do something wrong?

check_sgemm_suqare.cpp is of no use in the final version, I am sorry that typing error occurs in figure 10 of ReadMe.pdf.

( check_sgemm_suqare.cpp and check_general_sgemm_suqare.cpp only differ on device allocation sequence,

check_sgemm_suqare.cpp: allocate A first, then B, finally C.

check_general_sgemm_suqare.cpp: allocate C, then A, finally B)

Please note that method 1 ~ 4 don’t consider out-of-array bound.

suppose we use following code to test method 1,

int main(int argc, char** argv)

{

	int device_num = 0; // GTX285

	initContext_Drv( device_num ); // initial context

	profile_sgemm_square("../method1/method1.cubin", "method1_sgemmNN", 

		&method1_DrvWrapper, cat(OUTPUT_DIR,"method1/threads320.txt") );

		

  return 0;

}

then I re-test on my machine, vista64 + GTX285, it works.

However method 1 should not work on n = 5, since out-of-array bound occurs.

It works on my machine but may not work on yours.

Step 1: I want to make sure that your error comes from out-of-array bound for n = 5.

please modify for-loop in function “profile_sgemm_square” of profile_sgemm_suqare.cpp,

for( n1 = 5; n1 <=  4096; n1++){

		profile_sgemm_square_unit(sgemmWrapper, sgemm,  

					n1, n1, n1, alpha, beta,

					gpuTime, flop_rate, rel_max_err );

to

for( n1 = 64; n1 <=  4096; n1 += 64 ){

		profile_sgemm_square_unit(sgemmWrapper, sgemm,  

					n1, n1, n1, alpha, beta,

					gpuTime, flop_rate, rel_max_err );

If stride 64 does not work, please use stride 128.

(stride 128 must work if the error is purely out-of-array bound. )

Step 2: suppose error comes from out-of-array bound, and you want to test method 1 ~ method 4 for any dimension,

I will suggest that you can allocate bigger device memory d_A, d_B, d_C in function “profile_sgemm_square_unit”

of profile_sgemm_suqare.cpp, for example,

unsigned int lda_64 = 64 * ((lda + 63)/64); // lda_64 is multiple of 64 

  unsigned int ldb_16 = 16 * ((ldb + 15)/16); // ldb_64 is multiple of 16 

  unsigned int ldc_64 = 64 * ((ldc + 63)/64); // ldc_64 is multiple of 64

  unsigned int k_16   = 16 * ((k+15)/16); // k_16 is multiple of 16;

  unsigned int n_16   = 16 * ((n+15)/16); // n_16 is multiple of 16;

	unsigned int size_A = lda_64*k_16;

	unsigned int mem_size_A = sizeof(float) * size_A;

	h_A = (float*) malloc(mem_size_A); assert( h_A );

	unsigned int size_B = ldb_16*n_16;

	unsigned int mem_size_B = sizeof(float) * size_B;

	h_B = (float*) malloc(mem_size_B); assert( h_B );

	// allocate host memory for the result

	unsigned int size_C = ldc_64*n_16;

	unsigned int mem_size_C = sizeof(float) * size_C;

	h_C = (float*) malloc(mem_size_C); assert( h_C );

Step 3: method 5 ~ method 8 take care of out-of-array bound, you should use

function “profile_general_sgemm_square” to test it.

If error occurs in method 5 ~ method 8, please tell me.

Thanks.

Thanks! But it still does not work. I am using Ubuntu, and I don’t know whether it’s OS-related. Anyway, I manually modified my code to an extent similar to your method1_variant, and for N=4096, I got 485 GFLOPS on GTX285, compared to 425 GFLOPS with CUBLAS.
I tried 1 shared mem read for 4 mads, too, and it’s a little slower than CUBLAS. According to assembly generated by decuda, it’s bounded by low percentage of mad instructions. I wonder besides dual-issue or parallel shared mem/MAD, is volkov’s code or CUBLAS the best implementation?

my code is developed under 64-bit machine, do you use 32-bit OS?

If so, my code does not work since host code is compiled as 32-bit, size of pointer is 32-bit,

but CUDA binary file assumes that size of pointer is 64-bit.

I have tested my code on

(1) Fedora 10 x86_64, driver:190.18, cuda 2.3

(2) vista 64, winxp 64, driver:190.38, cuda 2.3

If your system is 64-bit, would you please describe version of OS, driver and cuda library you used?

I may try to find a machine to install your configuration and then test the code.

second, the best case in SGEMM is “MAD dest, src1, src2, src3” without any data movement from shared memory to register.

However matrix B must be shared by all threads, so “movement from shared memory to register” can not be avoided.

I think idea of Volkov’s reaches optimal.

1.I am using 32-bit OS, and I think that’s why.

2.I mean theoretically each thread computes 16x4 elements should deliever better performance without dual-issue. If we have more registers per SM, we could achieve that. Fermi seems have twice the number of registers than GTX200, but I dont have a card.

I think that one work-around may work, you can try.

in method1_DrvWrapper.cpp, you can extend 32-bit pointer to 64-bit pointer by zero extension.

that is, for example, modify

// parameter 3: const float *A 

	ptr = (void*)(size_t)A;

	ALIGN_UP(offset, __alignof(ptr));

	cuParamSetv(hfunc, offset, &ptr, sizeof(ptr));

	offset += sizeof( ptr );

to

// parameter 3: const float *A 

  void* ptr_array[2];

  ptr_array[0] = (void*) NULL;

	ptr_array[1] = (void*)(size_t)A;

	ALIGN_UP(offset, __alignof(ptr_array));

	cuParamSetv(hfunc, offset, ptr_array, sizeof(ptr_array) );

	offset += sizeof( ptr_array );

and you can do the same workaround for “float *B” and “float *C”.

You can this work-around.

yes, you are right.

Could you please check the link provided. I’m unable to read the paper…

Hi guys,

This looks great. A co-worker and I have an application where we could really benefit from this speed. BUT we’re both linux guys and are in need of a bit of assistance in how to use/compile this package. If anybody could throw up a makefile or jot down some instructions on how to call one of these routines from another cu file, that’d be great.

Thanks.

My CUDA code is compiled to binary file (.cubin), you don’t need to use nvcc to compile .cu file.

Just compile .cpp files by gcc or icpc, and put .cubin into the same directory such that

program can load .cubin at runtime.

However this code is built under CUDA 2.3

now CUDA 3.0 may change binary format, I have no idea that if my .cubin works for CUDA 3.0 or not.

I am sorry that I don’t test it.

@LSChien: Thanks for the information. Upon further reading I see that you already had that written down somewhere. Sorry I didn’t read more carefully before. You are right though about the cubin files. I’m using CUDA 3.1 and a GTX480. The cubin files won’t load which I guess isn’t surprising. I will try it out on another machine with CUDA 2.3 and a C1060 card.

If at all possible, could you provide cubin files for compute capability 2.0? I had no trouble compiling with CUDA 3.1, so I’m curious if you compiled with 3.1 and specified -arch=sm_20 if that would solve this problem. I’d be happy to post GTX480 benchmarks if we could get that running.

Thanks again. Really great work.

Incredible, so you changed the generated microcode using cudasm/decuda to speed up the SGEMM functions. Awesome work!

In fact, my approach depends on power of decuda/cudasm in http://wiki.github.com/laanwj/decuda/

since decuda/cudasm cannot support sm 2.0, I cannot write assembly code myself.

I think you have two ways

  1. MAGMA (http://icl.cs.utk.edu/magma/index.html) release new version and it includes sgemm,

according to its document,

Two main techniques distinguishing the MAGMA BLAS Library are

(1)Auto-tuners

(2)Pointer redirecting

so MAGMA would provide good and uniform performance on sgemm.

You can try it.

  1. you can also use my code but now it only works for C = alphaAB + beta*C.

unless NVIDIA releases its decuda, we cannot write assembly code in cuda3.0

I would like to use nvcc to compile my code and hope that compiler can solve

“volatile” issue which is not solved in cuda 2.3

My suggestion, you can use method6_variant. in method6_variant.cu

modify declaration in line 156,

float  b_reg;

to

volatile float  b_reg;

and line 188

b_reg = b_ptr[j] * 4.0f;

to

b_reg = b_ptr[j];

Then you can compile .cu to cubin file and then use driver API or

just call kernel function in your .cpp file.

The work I do is to make sure “volatile” is indeed “volatile”.

If nvcc in CUDA 3.0 can do that, no need to write assembly code manually.

However I have no idea if nvcc can do that or not, because we have no decuda on CUDA 3.0

If you determine to use my code and you cannot make it work. Tell me, I will make it work.

Thanks for the info. I’ll give it a try. Even without Fermi capability the code is great. My production code has ~3X speed-up from using your cgemm. Thanks for providing it.

Hi,
I’m interested in these fast SGEMM implementations, but I’m mainly working on IPhone and its GPU ( PowerVR’s SGX ) do not support either CUDA or OpenCL, but do support shaders using OpenGLES 2.0.
Does any of you know a way to easily convert this Cuda/OpenCL SGEMM code into a GLSL code so that I could use it on the iPhone ?
Thanks !