my speedy SGEMM

The matrices in the test were 750x750 which is typical for users of our application.

The 8gFlops was actually for the MKL on the DUO-core CPU and 5gFlops on the 8500GT.

This is fine as we will advise our customers to get at least an 8800 board but we wanted to try some of the low-end devices to see if we should build in some calibration steps s.t. for some supported Nvidia devices we continue to use the CPU even though the device is supported. Perhaps a better route would be to check the GPU device on program startup and not use it if it is below a certain level? This is just not as attractive as the relative performance of the GPU/CPU changes with time the cases of when and when not to use the GPU may change.

On a Tesla card with CUDA 1.1, these are the results:

./sgemmtest_pinned -b -m750 -n750 -k750

args: ta=N tb=n m=750 n=750 k=750 alpha=-1 beta=2

^^^^ elapsed = 0.01417416 sec GFLOPS=59.5273
^^^^ elapsed (GPU only) = 0.01112596 sec GFLOPS=75.8361
^^^^ Effective BW = 2952.56 MB/s

The first number (59 Gflops) is with I/O from pinned memory, the second number is on the GPU only.

Using a size multiple of 32 will increase the speed ( you can do padding on the host when you allocate the matrix or you can write a simple kernel that will do it on the card).

./sgemmtest_pinned -b -m768 -n768 -k768

args: ta=N tb=n m=768 n=768 k=768 alpha=-1 beta=2

^^^^ elapsed = 0.01210994 sec GFLOPS=74.8121
^^^^ elapsed (GPU only) = 0.00894906 sec GFLOPS=101.236
^^^^ Effective BW = 2985.62 MB/s

Having some calibration at startup is a good idea.

What a coincidence that this thread gets zombified. I’ve just dusted off and tried my matrix-mul code (from the OP) on the new sdk yesterday.

A quick recompile saw performance nearly half, and a bit of changing this and flipping that recovered most of this but not all. Clearly, the chaotic compiler hasn’t gained a lot of brains.

But what i discovered on the boards that really surprised me is that wumpus completed his toolchain, including a cubin assembler (!). I’m going to try it out to see if it gives me more control and more predictable performance.

EDIT: found out what’s wrong with the new SDK. ptxas is now using 37 registers. In 1.0, ptxas would use that many in -O1-4, but only 32 with -O0 (optimizations disabled). Now, it just thinks it’s smarter than me I guess.

On a 8600 GT @625/800, with ForceWare 169.13:

External Media

It should be interesting analyze any possible performance impact of Processor (and Hypertransport, on my good old Athlon64 3200+@2,5 Ghz).

Was able to take a much closer look with wumpus’s disassembler.

a = Ashared[threadIdx.yOUTPUTS_Y + 0 ][k];
out00 += b1.x * a;
out01 += b1.y * a
a = Ashared[threadIdx.y
OUTPUTS_Y + 1 ][k];
out00 += b1.x * a;
out01 += b1.y * a
… loop unrolled …

Using ptxas 1.0 -O0 resulted in this neat code:

op.12 // (unk0 04018078)// (unk1 0400c000)
mad.rn.f32 $r18, $r0, $r30, $r18
mad.rn.f32 $r17, $r30, $r1, $r17

(op.12 is a mov from shared that uses a hard-coded offsets

Ptxas 1.1 -O0 gives:
add.b32 $ofs3, $ofs2, 0x00000300
op.12 // (unk0 0c000090)// (unk1 0400c000)
mad.rn.f32 $r23, $r8, $r36, $r23
mad.rn.f32 $r22, $r36, $r9, $r22

Instead of just using offsets, ptxas the retard turns it into a dependent address calculation whose latency screws everything up

Ptxas 1.1 -O4:
add.b32 $ofs3, $ofs2, 0x00000f00
mad.rn.f32 $r21, s[$ofs4+0x0000], $r11, $r21
mad.rn.f32 $r21, s[$ofs4+0x0000], $r11, $r21

Now the load has been duplicated (wtf for?) and shoved into the mad instructions themselves. An unnecessary address calculation is still being performed, but it’s been pipelined to hide the latency. This code is slower than ptxas 1.0 -O0, but is identical speed to cublas.

Thanks to wumpus, I am now able to see why the inner loop of my code is much slower with v1.1 of the SDK.

Folks,

Good idea to test gemm codes, who is running a matrix larger than 1024x1024 ? in their testing?
Some times for larger matrices would be of great interest to me.

ok if they are blocked as multiples of 32 or 64, powers of two.

has any one tried the Tesla for gemm work, if so what are the numbers?

Cheers

waveone

Why does mod 32 give better performance? If the matrix is not square is it still beneficial for the entire number of elements to be modulus 32?

If the performance disparities are typical I suggest that this be made an option or default behaviour for SGEMM

Also, I guess I’m not clear on ‘padding’. You mean just the size of the memory transferred? If the user requests 750x750, that size has to be returned, and ‘padding’ the size up, obviously changes the result.

The idea of padding is the following one:

  1. transform original arrays from 750x750 to 768x768 ( filling all the extra space with zeros)
  2. class sgemm on the 768 arrays.
  3. get the original 750x750 back

You can write a small kernel that does this on the device.
Even better, just allocate your arrays on the host being a multiple of 32 and you will just pay a small overhead in the transfer.

Even MKL will be faster with size multiple of 32.

Thank you for the explanation. Indeed doing this for the MKL SGEMM also made about a 10% difference. It was interesting, however, that making this padding for the MKL DGEMM made very little difference. When the CUBLAS ins enhanced to support double can it be expected that this pad will also be less critical there, or will it likely remain important as it is with the cublas SGEMM?

Also, I am not clear why this enables better performance. Much appreciated if you could provide or point me to a source that gives some explanation for this.

Beau and others,

If the data flow into a gpu board goes through a cache memory which has a “hash” table mod a power of two then array sizes of powers of two can have horrible memory access patterns. I commonly use a(dim+1,dim) in Fortran and a(dim,dim+1) in C to create an access pattern than avoids the cache address conflict.

On the Sony Cell we have to pad out to 128 bytes (not sure this is the exact number) but a big pad for the starting address to align with the dma transfer block size.

Hope this helps,

Cheers

wave one

CUBLAS prefers multiples of 32 for all matrix dimensions because the
tile size for single-precision BLAS-3 in CUBLAS is 32x32 (this matches
well with having warps of 32 threads).

CUBLAS also prefers storage dimensions that are a multiple of 32 to get
alignment needed for coalescing global memory accesses when processing
submatrices (e.g. as part of a solver).

Thus, for best performance, SGEMM arguments lda, ldb, ldc, m, n, k should
all be multiples of 32.

In MKL, the increase in performance you are seeing is due to better utilization of the cache.

BTW, there is no cache in the G8x for global memory.

I managed to achieve 178 Gflop/s in sgemm on GeForce 8800 GTX, which is 40% faster than the best rate of 127 Gflop/s I’ve seen with CUBLAS 1.1. If you care, you can check the attached code.

I have a related question. What is the peak performance that can be achieved with multiply-add instruction when one of its operands is in the shared memory? (This setting is typical for sgemm.) The best I could get when all operands are in registers is 338 Gflop/s, which is pretty close to the theoretical peak of 346 Gflop/s. But if one of the operands is in the shared memory I get only 229 Gflop/s, which is curiously close to 346/1.5 = 231 Gflop/s, that is 6 cycles per FMAD for a warp on one multiprocessor versus the usual 4 cycles. Is it possible to do it faster than this number?

Accessing an smem operand is as fast as a register. However, additional instructions are needed to prepare the indexing into smem array. If I had to guess, I would say you lose perf to those computations, especially if you have many operations with smem operands which require different indices.

Paulius

I also get 178 GFlop/s on a Quadro FX 5600 (on machine running Linux).
Very impressive! I had to update my drivers though.

So this example is for C = \alpha A B^T + \beta C. I wanted to see what
will happen for C = \alpha A^T B^T + \beta C so in sgemmNT I modified the lines

// A += ibx + inx + __mul24( iny, lda );
A += __mul24(ibx+inx, lda) + iny ;

and
// a[iny][inx] = A[i*lda];
// a[iny+2][inx] = A[(i+2)*lda];
a[iny][inx] = A[i];
a[iny+2][inx] = A[i+2];

As the code and the memory accesses to the shared memory are the same
I expected to see the same speed but instead, for n=2400 for example,
I get only 27.39GFlop/s. What may be the problem here?

Thanks for trying it on different GPU! It’s interesting that it runs at the same peak rate. Did it run slower with older drivers?

The problem is that you lose memory coalescing in accessing matrix A.

I tried different solutions to that problem when multiplying \alpha AB+\beta C, which is similar to \alpha A^TB^T+\beta C. The best solution in terms of peak performance was transposing one of the input matrices (B in my case, A in your) and then running the code without changes. That could make sense for square matrices, since transposing is only O(n^2) work. To avoid extra memory requirements I transposed each 32x32 block of the input matrix in-place, run slightly modified sgemm code, then transposed it back. That gave 176 Gflop/s peak. Another solution that was better in some other cases was accessing the input matrix using texture fetches. That gave 158 Gflop/s peak.

I’ve observed the same effect as vvolkov, and as I’ll show below, I’m pretty sure its not the pointer math. But a quick aside to briefly explain the pointer math a bit more; from what I can tell from wumpus’ dissassembler (btw, a fantastic tool for optimization) there are four base pointer registers and a maximum offset of ~0x80 from these registers when accessing a shared memory operand in a mad. This can result in additional instructions to setup the base pointers if you are not careful with the span of shared memory accesses – for example, transposing the Bs sub-matrix in the SDK multiply to make it Bs[tx][k] instead of Bs[k][tx] (and making it a [16][17] array to eliminate resulting bank conflicts) reduces # of instructions in the inner loop due to fewer base pointer manipulations.

Back to the problem at hand: The disassembly for sgemmNT (vvolkov’s kernel) shows a tight inner loop of 92 instructions, of which 64 are shared-memory MAD instructions. Assuming that all global read latency can be perfectly hidden (a big if, but one I will address), then I would guess that this code would hit 64/92 = 69.5% of the theoretical peak GFLOP/s possible, assuming all instructions are 4-cycle. Everything in the inner loop are simple arithmetic (no 32-bit muls, no divs, etc) and there is no branching or other divergent behaviour. Yet the observed performance is 30% slower than I would expect.

I get a very similar drop-off in performance for my own sgemm kernel, which uses a different approach than vvalkov, so I began to suspect this isn’t just unhidden memory latency.

To further explore this, I threw in a bunch of different new instructions into the inner loop of the kernel. I tried adding “add” instructions, and three flavours of mad: mad with a constant, mad with a register, and mad with a shared memory operand. In each case, I added 64 instructions, which I validated in the disassembler output. The results:

CODE               Inst    Obs. FLOPs    Obs/Expected    Overhead (insts)

original            92        114.18       71.2%            37.1 

+64 adds           156         77.08       81.5%            32.8

+64 const mads     156         74.33       78.6%            42.4

+64 reg mads       156         72.89       77.1%            46.3

+64 shared mads    160         62.09       67.3%            77.5

The “Overhead (insts)” is how much longer the code to operate, expressed in units of 4-cycle instructions. Now, as we add more instructions to the compute loop, if there was any unhidden global memory access latency in the original, we’d start hiding more of that latency. That’s what seems to be happening in the +64 adds case. The mads seem to all take a little longer than the adds; perhaps this is just instruction fetch/issue time due to the instruction complexity (there are only 2 warps per block in flight in this kernel, so maybe it can’t all be hidden).

The important observation is that the shared mads clearly greatly increase the overhead. In fact, adding 64 shared mads adds roughly 32 cycles of added hidden overhead, which aligns with vvalkov’s feeling that shared mads are taking an (effective) 6 cycles to execute.

It would be great if someone from NVidia could shed some further light on the subject.

Thanks,

Paul

If we assume that those 64 MADs take 6 cycles per warp each and the remaining 92-64=28 instructions take 4 cycles per warp each, then with 16 multiprocessors running at 1.35 GHz we get 1283216 flops/(646 + 284 cycles)*1.35 GHz = 178 Gflop/s, which is exactly the peak rate achieved in this code on GeForce 8800 GTX.

I could get only 114 Gflop/s in MAD with one operand in the shared memory when there are 2-way bank conflicts in each access and 57 Gflop/s when the conflicts are 4-way (229 Gflop/s if no conflicts). This could mean that it slows down in accessing the shared memory, not in preparing the indexing.

Few days ago I had a chance to talk to one guy in architecture from NVIDIA. As I understood him, there indeed may be an extra overhead when an operand of an arithmetic operation is in the shared memory. This might be fixed in future architectures.

Vasily

Hi vvolkov,

BTW, I’ve got a couple tweaks (micro-optimizations) to your kernel. The first is to combine your a and b matrices. This should make no difference at all, except that the compiler doesn’t seem to like sharing base pointers for independent shared arrays. This increases performance from 113.5 → 117.5 GFLOPs on my 8800 GTS 640MB.

The second change is perform the global load in your computation portion of the kernel instead of in a seperate section. Now, sometimes the PTXAS tool pushes these loads to the bottom of the section, despite there being enough registers to do it as early as possible, but when it doesn’t, this gives another ~4.5 GFLOPs on my card. Basically, it eeks out the last bit of latency hiding.

Could someone try this kernel out on their GTX or Ultra?

__global__ void sgemmNT( const float *A, int lda, const float *B, int ldb, float* C, int ldc, int k, float alpha, float beta )

{	

  int inx = threadIdx.x;

  int iny = threadIdx.y;

  int ibx = blockIdx.x * 32;

  int iby = blockIdx.y * 32;

	

  A += ibx + inx + __mul24( iny, lda );

  B += iby + inx + __mul24( iny, ldb );

  C += ibx + inx + __mul24( iby + iny, ldc );

	

  float c[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};

  float preA0, preA1, preB0, preB1;

 // Fetch the first set of globals; if worried about segfault, guard against k < 4 case with an if statement.

  preA0 = A[0];

  preA1 = A[2*lda];

  preB0 = B[0];

  preB1 = B[2*ldb];

 for( int i = 0; i < k; i += 4) {

    __shared__ float ab[2][4][32];

    #define AS(x,y) ab[0][x][y]

    #define BS(x,y) ab[1][x][y]

   AS(iny,inx) = preA0;

    AS(iny+2,inx) = preA1;

    BS(iny,inx) = preB0;

    BS(iny+2,inx) = preB1;

    __syncthreads();

   /* Prefetch global memory for the next iteration.  This generally helps to improve performance,

     * but also seems to cause more noise in the runtime of the kernel.  We assume its OK to read 

     * beyond the array boundaries; technically this could cause the equivalent of a segfault in

     * the GPU, in which case we could solve the problem by stopping the for loop at k-4, and 

     * then adding a last "fix-up" loop iteration without a prefetch.  Adding an if statement 

     * here will kill performance and thus is off the table. */

    preA0 = A[(i+4)*lda];

    preA1 = A[(i+6)*lda];

    preB0 = B[(i+4)*ldb];

    preB1 = B[(i+6)*ldb];

#pragma unroll

    for( int j = 0; j < 4; j++ ) {

      float _a = AS(j,inx);

      float *_b = &BS(j,0) + iny;

      c[0] += _a*_b[0];

      ...

      c[15] += _a*_b[30];

    }

    __syncthreads();

  }

 for( int i = 0; i < 16; i++, C += 2*ldc )

    C[0] = alpha * c[i] + beta * C[0]; 

}