Hi vvolkov,
BTW, I’ve got a couple tweaks (micro-optimizations) to your kernel. The first is to combine your a and b matrices. This should make no difference at all, except that the compiler doesn’t seem to like sharing base pointers for independent shared arrays. This increases performance from 113.5 → 117.5 GFLOPs on my 8800 GTS 640MB.
The second change is perform the global load in your computation portion of the kernel instead of in a seperate section. Now, sometimes the PTXAS tool pushes these loads to the bottom of the section, despite there being enough registers to do it as early as possible, but when it doesn’t, this gives another ~4.5 GFLOPs on my card. Basically, it eeks out the last bit of latency hiding.
Could someone try this kernel out on their GTX or Ultra?
__global__ void sgemmNT( const float *A, int lda, const float *B, int ldb, float* C, int ldc, int k, float alpha, float beta )
{
int inx = threadIdx.x;
int iny = threadIdx.y;
int ibx = blockIdx.x * 32;
int iby = blockIdx.y * 32;
A += ibx + inx + __mul24( iny, lda );
B += iby + inx + __mul24( iny, ldb );
C += ibx + inx + __mul24( iby + iny, ldc );
float c[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
float preA0, preA1, preB0, preB1;
// Fetch the first set of globals; if worried about segfault, guard against k < 4 case with an if statement.
preA0 = A[0];
preA1 = A[2*lda];
preB0 = B[0];
preB1 = B[2*ldb];
for( int i = 0; i < k; i += 4) {
__shared__ float ab[2][4][32];
#define AS(x,y) ab[0][x][y]
#define BS(x,y) ab[1][x][y]
AS(iny,inx) = preA0;
AS(iny+2,inx) = preA1;
BS(iny,inx) = preB0;
BS(iny+2,inx) = preB1;
__syncthreads();
/* Prefetch global memory for the next iteration. This generally helps to improve performance,
* but also seems to cause more noise in the runtime of the kernel. We assume its OK to read
* beyond the array boundaries; technically this could cause the equivalent of a segfault in
* the GPU, in which case we could solve the problem by stopping the for loop at k-4, and
* then adding a last "fix-up" loop iteration without a prefetch. Adding an if statement
* here will kill performance and thus is off the table. */
preA0 = A[(i+4)*lda];
preA1 = A[(i+6)*lda];
preB0 = B[(i+4)*ldb];
preB1 = B[(i+6)*ldb];
#pragma unroll
for( int j = 0; j < 4; j++ ) {
float _a = AS(j,inx);
float *_b = &BS(j,0) + iny;
c[0] += _a*_b[0];
...
c[15] += _a*_b[30];
}
__syncthreads();
}
for( int i = 0; i < 16; i++, C += 2*ldc )
C[0] = alpha * c[i] + beta * C[0];
}