Hello,

I wrote a simple matmul kernel for float and int and bench-marked on a Quadro K1200 with matrices of sizes 1024x1024x1024, 16384x128x128, 128x16384x128 ( NxKxM : C[NxM] = A[NxK] * B[KxM] ). Here’s the kernel,

```
__global__ void matmul(int N, int K, int M, Type *C, Type *A, Type *B)
{
int i_idx = blockIdx.x,
j_idx = blockIdx.y*BLK_SIZE + threadIdx.x;
if( i_idx >= N || j_idx >= M )
return;
int k;
Type temp = C[i_idx*M+j_idx];
Type *A_ptr = A + i_idx*K + 0,
*B_ptr = B + 0 *M + j_idx;
for( k=0 ; k<K ; k++ )
{
temp += A_ptr[ k ] * B_ptr[ 0 ];
B_ptr += M;
}
C[i_idx*M+j_idx] = temp;
return;
}
```

I found that the integer performance lagged behind float’s a little bit ( 1%-6% more time ). Compiling with -lineinfo, nvvp revealed that the kernels used XMAD and FMAD instructions respectively. The nvvp analysis also revealed that the kernel was spending most of the time in memory bound operations, which meant that optimizing the kernel would only widen the int - float performance gap.

One way to justify this is if the GPU had more FPUs compared to integer units or/and the integer operations are somehow emulated through floats. But, K1200’s datasheet[1] doesn’t mention these details.

I would like some help and hints on understanding the performance drop of the integer kernel. Also, please point me to literature which would help me to understand and act upon nvvp’s analysis.

Thanks