Hand-Tuned SGEMM on GT200 GPU 10% ~ 20% improvement of SGEMM

I would like to report our SGEMM routine on GT200 GPU. This work inherits Volkov’s code, which can be downloaded from
http://forums.nvidia.com/index.php?showtopic=89084

Figure 1 shows performance (Gflop/s) of our method on TeslaC1060, GTX285 and GTX295. The baseline is Volkov’s code on TeslaC1060 (black dash line).
Core frequency of GTX285 is 1.135x than that of TeslaC1060, and it is reasonable that performance of GTX285 is 1.165x than that of TeslaC1060.
Maximum performance of our method on TeslaC1060 in Figure 1 is 439.467Gflop/s, this number achieves 70% of peak performance without dual issue
(peak performance of single precision without dual issue on TeslaC1060 is 624 Gflop/s) whereas Volkov’s code reaches 346 Gflop/s, which is 55.45% of peak performance.


figure 1: comparison between our method and Volkov’s code

technical report and source code can be downloaded from
(1) technical report: http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEM…m_2010_v1.1.pdf
(2) source code: http://oz.nthu.edu.tw/~d947207/NVIDIA/SGEMM/lsc_sgemm.zip

Hope that you can give me some comments and criticism.

Lung Sheng

Lung-Sheng,

I always love your detailed experiments with GPU performance. It takes patience and persistence to understand and exploit CUDA’s subtleties like this. Thanks for sharing the results (and code!)

From my first skim at your report, you talk about using decuda and splicing in new PTX instructions and recompiling.

Have you tried using the inline assembly features of CUDA? It isn’t officially supported, but Kaiyong Zhao (kyzhao here on the CUDA forum) mentioned it to me at GTC09. It uses the fact that nvcc uses a flavor of Open64.

For example, you can write code in CUDA toolkit 2.3 like:

__device__ uint float_to_u8(float value)

{

  uint result;

  asm("cvt.sat.rni.u8.f32 %0, %1;" : "=r" (result) : "f" (value));

  return result;

}

I’m sure the NV boys here will smile uneasily and say “Don’t depend on this ASM feature in CUDA” but it’s easier than splicing the cubin files!

SPWorley, Thanks for this trick.

In fact, I don’t modify PTX instructions but modify assembly codes generated by decuda.

PTX code realizes my idea but PTX code is not optimized. After optimizatoin done by nvcc,

then binary code does not realize my idea. That is why I modify assembly codes generated by decuda.

However it is helpful to me if this trick can work because it is time-consuming to modify

assembly codes generated by decuda by splicing the binary file. Averagely speaking, I need

near one hour to do this.

I try this trick on Volkov’s code but suffer a serious problem

I replace “b_reg = b_ptr[j] ;” in the following code

float *b_ptr = (float*)b;

#pragma unroll		 		

		for( int i = 0; i < 16; i++  ){

			float A_reg = A[0] ; A += lda;

#pragma unroll

			for( int j = 0; j < 16; j++){

				b_reg = b_ptr[j];

				c[j] += A_reg * b_reg;

			}			

			b_ptr += 17;	// b_ptr = &b[i][0]

		}// for each column index of sub-matrix of A

by asm(“ld.shared.f32 %0, %1;” : “=f” (b_reg) : “m” (b_ptr[j]) )

float *b_ptr = (float*)b;

#pragma unroll		 		

		for( int i = 0; i < 16; i++  ){

			float A_reg = A[0] ; A += lda;

#pragma unroll

			for( int j = 0; j < 16; j++){

				asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr[j]) );

				c[j] += A_reg * b_reg;

			}			

			b_ptr += 17;	// b_ptr = &b[i][0]

		}// for each column index of sub-matrix of A

but compilation error occurs as

./volkov_v2.cu(187): Advisory: Loop was not unrolled, inline assembly

./volkov_v2.cu(183): Advisory: Loop was not unrolled, not an innermost loop

ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx

, line 1; error   : Missing .version directive at start of file 'C:\DOCUME~1\lsc

hien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx'

ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx

, line 1; error   : Missing .target directive at start of file 'C:\DOCUME~1\lsch

ien\LOCALS~1\Temp/tmpxft_0000054c_00000000-4_volkov_v2.ptx'

ptxas fatal   : Ptx assembly aborted due to errors

I think that this is because Memory operand constraint(m), the instruction should

be like “ld.shared.f32 %f277, [__cuda_b64 + offset ];”.

I cannot find useful material (example) to do “shared memory —> register”.

any idea?

I think the loop unrolling warnings are a general problem with nvcc - if you try unrolling a loop with a texture read in it (which I presume also generates inline assembly), nvcc reports the same thing. For the inline assembly itself, try

__asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+j) );

and see what happens.

I admit I am nowhere near as skilled as you with this kind of low level hacking, but a promising idea might be to expand the loop yourself manually… I’m not sure you can use the [j] indirection in the asm string. So you could try:

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+0) );

	  c[0] += A_reg * b_reg;

	asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+1) );

	  c[1] += A_reg * b_reg;

  asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+2) );

	  c[2] += A_reg * b_reg;

...

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+15) );

	  c[15] += A_reg * b_reg;

(maybe those pointer increments go 0, 4, 8, 12… 60 ?)

Hi LSChien,

That is a very clever idea! :)

I don’t know if this is any help, but for my very simple example, I’ve been able to use NVCC by itself, to transfer smem data to regs and then do MAD instructions with all operands in regs.
http://forums.nvidia.com/index.php?showtop…mp;#entry993085

Case 2 has the desired effect

The trick is to put a syncthread() after the smem to reg transfer. But I’ve found if you put more than one syncthread() (in my example case 1) lead to using operands from smem.

Thank you, Nikolai.

I try your suggestion, case 2, into my algorithm, method 1.

method 1 + __synch after “b_reg = b_ptr[j] ;”

b_ptr = (float*)b;

#pragma unroll		  

		for( int i = 0; i < BLOCK_SIZE_X; i++  ){

			float A0_reg = A[0]; A += lda;

			float A1_reg = A1[0]; A1 += lda;

#pragma unroll

			for( int j = 0; j < BLOCK_SIZE_Y; j++){

				b_reg = b_ptr[j];

				__syncthreads();

				c0[j] += A0_reg * b_reg;

				c1[j] += A1_reg * b_reg;	

			}

However nvcc reports error message as

tmpxft_00000bb8_00000000-3_method1_thread.cudafe1.gpu

tmpxft_00000bb8_00000000-8_method1_thread.cudafe2.gpu

nvopencc ERROR: C:\CUDA\bin64/../open64/lib//be.exe returned non-zero status -10

73741819

I disable fully loop-unrolling of outer loop and change outer loop to #pragma unroll 8

(BLOCK_SIZE_X = 16, BLOCK_SIZE_Y = 16). Also I use compiler option “-maxrregcount 48”,

such that number of active thread = 320 (register count per thread = 46 ),

the same as method1_variant in report.

The code is called method1(synch)

b_ptr = (float*)b;

#pragma unroll 8 		

		for( int i = 0; i < BLOCK_SIZE_X; i++  ){

			float A0_reg = A[0]; A += lda;

			float A1_reg = A1[0]; A1 += lda;

#pragma unroll

			for( int j = 0; j < BLOCK_SIZE_Y; j++){

				b_reg = b_ptr[j];

				__syncthreads();

				c0[j] += A0_reg * b_reg;

				c1[j] += A1_reg * b_reg;	

			}

I indeed obtain "shared memory —> register " from result of decuda.

However the performance in figure 2 is not good, only 250Gflop/s.

I think that this is because of too many __syncthreads().

figrue 2: performance comparison between method1(variant) and method1(synch)

method1(synch) only reaches 250Gflop/s, 30% slower than method1(variant).

In order to achieve high performance, we need to remove extra __synch operations.

However we must reset offset of jump instruction after removing extra __synch operations.

cudasm can configure offset of jump instructions but we need to check other binary codes

generated by cudasm.

Besides, there is one thing strange. if I unroll inner loop explicitly, then

nvcc would remove extra __synch automatically and use “MAD dest, [smem], src2, src3”,

I don’t know why.

by the way, could you share your example?

Thank avidday and SPWorley for your opinions.

__asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+j) );

and

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+0) );

	  c[0] += A_reg * b_reg;

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+1) );

	  c[1] += A_reg * b_reg;

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+2) );

	  c[2] += A_reg * b_reg;

...

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b_ptr+15) );

	  c[15] += A_reg * b_reg;

do not work when compiling to binary file because nvcc translate

asm(“ld.shared.f32 %0, %1;” : “=f” (b_reg) : “m” (b_ptr+0) );

to

 ld.shared.f32 %f7, b_ptr ;

but nvcc does not recognize variable “b_ptr”, the error message is

./volkov_v2.cu(183): Advisory: Loop was not unrolled, inline assembly

ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_00001c9c_00000000-4_volkov_v2.ptx

, line 149; error   : Address expected for argument 1 of instruction 'ld'

ptxas C:\DOCUME~1\lschien\LOCALS~1\Temp/tmpxft_00001c9c_00000000-4_volkov_v2.ptx

, line 149; error   : Unknown symbol 'b_ptr'

ptxas fatal   : Ptx assembly aborted due to errors

if one uses

asm("ld.shared.f32 %0, %1;" : "=f" (b_reg) : "m" (b+0) ); // b is shared memory

, then nvcc translates it to

ld.shared.f32 %f7, asm.by.address.temp_0 ;

I think that something wrong on translation of shared memory.

Finally I design a workaround

float A_reg; 

			float *b64 = (float*)b;

			A_reg = A[0]; A += lda;

			asm("ld.shared.f32 %0, [__cuda_%1+0];" : "=f" (b_reg) : "m"(b64) );

			c[0] += A_reg * b_reg;

			asm("ld.shared.f32 %0, [__cuda_%1+4];" : "=f" (b_reg) : "m"(b64) );

			c[1] += A_reg * b_reg;

			asm("ld.shared.f32 %0, [__cuda_%1+8];" : "=f" (b_reg) : "m"(b64) );

			c[2] += A_reg * b_reg;

			....

			c[12] += A_reg * b_reg;

			asm("ld.shared.f32 %0, [__cuda_%1+1072];" : "=f" (b_reg) : "m"(b64) );

			c[13] += A_reg * b_reg;

			asm("ld.shared.f32 %0, [__cuda_%1+1076];" : "=f" (b_reg) : "m"(b64) );

			c[14] += A_reg * b_reg;

			asm("ld.shared.f32 %0, [__cuda_%1+1080];" : "=f" (b_reg) : "m"(b64) );

			c[15] += A_reg * b_reg;

which is explicit loop-unrolling of

float A_reg;

float *b64 = (float*)b;	 

#pragma unroll		 		

		for( int i = 0; i < 16; i++  ){

			A_reg = A[0] ; A += lda;	

#pragma unroll

			for( int j = 0; j < 16; j++){

				int offset = 4*(j+17*i);

				asm("ld.shared.f32 %0, [__cuda_%1];" : "=f" (b_reg) : "m" (b64 + offset ) );	

				c[j] += A_reg * b_reg;

			}

The PTX code shows correct translation,

[code]

ld.shared.f32 %f22, [__cuda_b64+0];

ld.shared.f32 %f24, [__cuda_b64+4];

ld.shared.f32 %f278, [__cuda_b64+1080];

[code]

However the binary code still uses “MAD dest, [smem], src2, src3”, not “MAD dest, src1, src2, src3”

(one can verify this via decuda). Even we can use inline assembly to write PTX code,

nvcc would do optimization himeself and may not match your design.

30% slower?! that’s unfortunate…

i’m think the sycnthread() is causing some kind latency… i’m guessing you’re running multiple blocks per MP, maybe the sycnthreads is preventing the other blocks from hiding the smem latency (but at 320 threads, i wouldn’t think there would be much)

cuz if you just count instructions, every time you do a a syncthread instruction, you also do 16 MAD instructions (right?) and 1 MOV instruction . Hence if you include the syncthread it should run at 17/18 (or 94.4%) of your original method. Now that I think about it, even 5.6% is a lot considering it cuts half of your 10% improvement. So the only benefit is that you can use NVCC… but unfortunately syncthreads isn’t behaving as we would like

here is the kernel to my code:

[codebox]global void min_ops( float *gA, float *gB )

{

int i;

const int tx = threadIdx.x;

volatile float a[9], b[9];

volatile float c[9] = {0,0,0,0,0,0,0,0,0};

volatile __shared__ float sA[NUM_ELEMENTS], sB[NUM_ELEMENTS];

#pragma unroll 20

for ( i = 0 ; i < TEST_ITERATIONS ; i ++ )

{                       

smem2reg(a, b, sA, sB, tx);

__syncthreads();

mad_ops(a, b, c);

reg2smem(c, sA, tx);

}

__syncthreads();

reg2gmem(c, gA, tx);

}[/codebox]

I ran it using 1 block per MP w/ 192 threads.

smem2reg: moves 18 vectors

mad_ops: performs 27 mad operations

reg2smem: moves 9 vectors

@ 192 threads, I’ve found I only achieve 88% of the smem bandwidth

Incorporating the 88% with clock cycle counting, I can accurately account for every instruction issued in the loop, where the syncthread() instructions accounted for 4cc per warp (which of course is the expected throughput)

However in my example, i have 1 sychthread against (27 MAD instructions + 27 MOV instructions), thus I only experience 1/55 = 1.8% slower code due to sycnthreads

thanks for trying, great job on the sgemm :thumbup:

Nikolai, thanks for explanation.

You are right, cost of __synch can be estimated.

In my code, 2 MADs + 1 Move when do a __synch, that means I have 1/3 extra performance degradation,

which matches experimental result.

So the key point is how many __synch we should pay.

oh geez you’re only doing 2 MAD? cuz then using syncthreads is not viable (i.e. your 30% reduction in speed), sorry about that :">

Great performance!

Note that peak performance in multiply-and-add is 624 Gflop/s with or without dual issue. Comparing to the 50% higher peak, which implies another multiply done in parallel, would be misleading for matrix-matrix multiply performance analysis - you can’t use that multiply anyway.

Then I wonder, if your 70% is the limit. If you know how to find the way around the “multiply-and-add with operand in shared memory” bottleneck, then you are no more bound by its 66% throughput. Why not push all the way up to, say, 90% of peak?

I like that you did it in the native asm. It’s too bad that GPU companies are not serious about asm support. Many HPC problems would be so much more simpler if asm was available. It would also encourage independent compiler development. I secretly hope that competition with Intel will change that ;)

Thanks, you are right. In SGEMM, we always count number of MAD operations, so 624 Gflop/s is peak performance.

I check assembly code of my method 1, it has regular pattern, one “MOV reg, [smem]” and two “MAD dest, src1,src2,src3” interleave each other.

Such pattern is good for dual issue, I wonder if I can improve it, from 70% to 90%.

On the contrary, I also implment volkov_variant, which modifies “MAD dest, [smem], src2, src3” in volkov’s code into

“MOV reg, [smem]” and “MAD dest, src1,src2,src3”, however no speedup is seen in volkov_variant.

The reason is clear now since pattern of one “MOV reg, [smem]” and one “MAD dest, src1,src2,src3” is not regular,

several “MOV reg, [smem]” are grouped together (I think this is side effect of loop-unrolling).

It is necessary to write assembly manually in volkov_variant such that

one “MOV reg, [smem]” and one “MAD dest, src1,src2,src3” is regular.

Then we can check effect of dual issue on Volkov’s code.

Original Volkov’s code is more competitive since it has 512 active threads per SM.

I will re-survey cudasm and try to wrtie assembly manually and report the result later.

It may be not due to dual-issue then. One smem-MOV per two register-MADs means that you are using a kind of register blocking, which reduces smem traffic per flop. The problem may be not simply in MADs-with-smem-operand, but in accessing smem in parallel with MAD no matter how you do that - using one or multiple instructions. A kind of bandwidth bottleneck. In that case you might still get further speedup if you use larger register blocks - such as one smem-MOV per four register-MADs.

I also wonder why you get 55.45% on GT200 with my code. I get 60% on GTX280.

That is my original idea, reduce number of “reg <-- share memory”.

However under the case, “one smem-MOV per four register-MADs”, there are only 192 active threads per SM,

performance is not good.

I get about 55.4% on TeslaC1060, GTX295 and GTX285.

for dimension n > 2046, I execute SGEMM twice and average it.

There are some ideas in this PAPER that may be of some relevance here. Although they do not succeed in as fast performance on SGEMM (still faster than volkov’s though), there are some ideas here that may be relevant to further acceleration of your SGEMM. In particular, the experiments done to see how one can obtain peak performance in MAD operations (registers over shared memory as you have already observed, but also the order of operands in adjacent MAD instructions seems to matter).

Thanks for sharing this paper.

The speedup in the paper is due to

(1) load 64-bit (float2) instead of 32-bit (float) and

(2) reduce number of _synch by using 16x32 sub-block

however MAD operation is fixed as c = ab + c, so "a=ba+b; b=a*b+a" is impossible in SGEMM.

I think that “reduce number of _synch” is a good point, I will try this.

Hi, LSChien.
I’m wondering what is the performance of your tuned SGEMM running on G80 GPUs compring to Volkov’s SGEMM code.
if the improvement is not due to dual-issue but is due to the reduce use of smem, should it also improve performance on G80?

my method is not good on G80 since under compute capability 1.1, there are only 128 active threads per SM.

“128 active threads” can not hide pipeline latency.

Hence I expect that my method would be much slower than Volkov’s code on G80.

Thanks your opinion, I have one 8800GT, I will report my result on 8800GT later.

I test Volkov’s code and “method 1” on 8800GT.

(single precision performance without dual issue is 336Gflop/s)

volkov reaches 186 Gflop/s but

“method 1” reaches 135 Gflop/s (38% slower than Volkov’s code)

I cannot verify which one, “dual issue” or “reduce use of smem”, on 8800GT.