int32 Vs float32 performance difference and analysis advice

Hello,

I wrote a simple matmul kernel for float and int and bench-marked on a Quadro K1200 with matrices of sizes 1024x1024x1024, 16384x128x128, 128x16384x128 ( NxKxM : C[NxM] = A[NxK] * B[KxM] ). Here’s the kernel,

__global__ void matmul(int N, int K, int M, Type *C, Type *A, Type *B)
{
	int i_idx = blockIdx.x,
	    j_idx = blockIdx.y*BLK_SIZE + threadIdx.x;

	if( i_idx >= N || j_idx >= M )
		return;

	int k;
	Type temp = C[i_idx*M+j_idx];

	Type *A_ptr = A + i_idx*K + 0,
	     *B_ptr = B +   0  *M + j_idx;

	for( k=0 ; k<K ; k++ )
	{
		temp += A_ptr[ k ] * B_ptr[ 0 ];
		B_ptr += M;
	}


	C[i_idx*M+j_idx] = temp;

	return;
}

I found that the integer performance lagged behind float’s a little bit ( 1%-6% more time ). Compiling with -lineinfo, nvvp revealed that the kernels used XMAD and FMAD instructions respectively. The nvvp analysis also revealed that the kernel was spending most of the time in memory bound operations, which meant that optimizing the kernel would only widen the int - float performance gap.

One way to justify this is if the GPU had more FPUs compared to integer units or/and the integer operations are somehow emulated through floats. But, K1200’s datasheet[1] doesn’t mention these details.

I would like some help and hints on understanding the performance drop of the integer kernel. Also, please point me to literature which would help me to understand and act upon nvvp’s analysis.

Thanks

[1] https://images.nvidia.com/content/quadro/product-literature/data-sheets/11306_DS_NV_Quadro_K1200_FEB15_NV_US_HR.pdf

XMAD is not a 32-bit integer multiply-add. XMAD is a 16-bit integer multiply followed by a 32-bit add to the full width product. Therefore a 32-bit integer multiply-add operation requires several XMAD instructions and is more expensive than a single-precision floating-point multiply-add operation which maps to a single FFMA instruction. You should be able to see different dynamic instruction counts in the profiler for these two versions.

The programming guide lists instruction throughputs for various architectures:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

Get your compute capability from deviceQuery. Then use that to pick a column from the table 2 referenced above.

The instruction set reference may also be useful:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref

The instructions are not well documented but for what you are trying to do it may not matter. Further understanding can be gained by reading the PTX manual. You may want to familiarize yourself with what is available at docs.nvidia.com

Kepler should have a native 32-bit integer multiply-add instruction which would not be XMAD, but IMAD. As you can see that has relatively low throughput (32/192) compared to the FFMA that would be used in float matrix multiply. It may also be that the compiler forgoes usage of this in favor of XMAD. Or maybe njuffa can explain.

Are you compiling with an arch switch that matches your architecture (compute capability) ?