int32 Vs float32 performance difference and analysis advice

singam-sanjay · July 31, 2017, 6:13am

Hello,

I wrote a simple matmul kernel for float and int and bench-marked on a Quadro K1200 with matrices of sizes 1024x1024x1024, 16384x128x128, 128x16384x128 ( NxKxM : C[NxM] = A[NxK] * B[KxM] ). Here’s the kernel,

__global__ void matmul(int N, int K, int M, Type *C, Type *A, Type *B)
{
	int i_idx = blockIdx.x,
	    j_idx = blockIdx.y*BLK_SIZE + threadIdx.x;

	if( i_idx >= N || j_idx >= M )
		return;

	int k;
	Type temp = C[i_idx*M+j_idx];

	Type *A_ptr = A + i_idx*K + 0,
	     *B_ptr = B +   0  *M + j_idx;

	for( k=0 ; k<K ; k++ )
	{
		temp += A_ptr[ k ] * B_ptr[ 0 ];
		B_ptr += M;
	}


	C[i_idx*M+j_idx] = temp;

	return;
}

I found that the integer performance lagged behind float’s a little bit ( 1%-6% more time ). Compiling with -lineinfo, nvvp revealed that the kernels used XMAD and FMAD instructions respectively. The nvvp analysis also revealed that the kernel was spending most of the time in memory bound operations, which meant that optimizing the kernel would only widen the int - float performance gap.

One way to justify this is if the GPU had more FPUs compared to integer units or/and the integer operations are somehow emulated through floats. But, K1200’s datasheet[1] doesn’t mention these details.

I would like some help and hints on understanding the performance drop of the integer kernel. Also, please point me to literature which would help me to understand and act upon nvvp’s analysis.

Thanks

[1] https://images.nvidia.com/content/quadro/product-literature/data-sheets/11306_DS_NV_Quadro_K1200_FEB15_NV_US_HR.pdf

njuffa · July 31, 2017, 7:47am

XMAD is not a 32-bit integer multiply-add. XMAD is a 16-bit integer multiply followed by a 32-bit add to the full width product. Therefore a 32-bit integer multiply-add operation requires several XMAD instructions and is more expensive than a single-precision floating-point multiply-add operation which maps to a single FFMA instruction. You should be able to see different dynamic instruction counts in the profiler for these two versions.

Robert_Crovella · July 31, 2017, 1:33pm

The programming guide lists instruction throughputs for various architectures:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

Get your compute capability from deviceQuery. Then use that to pick a column from the table 2 referenced above.

The instruction set reference may also be useful:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref

The instructions are not well documented but for what you are trying to do it may not matter. Further understanding can be gained by reading the PTX manual. You may want to familiarize yourself with what is available at docs.nvidia.com

Kepler should have a native 32-bit integer multiply-add instruction which would not be XMAD, but IMAD. As you can see that has relatively low throughput (32/192) compared to the FFMA that would be used in float matrix multiply. It may also be that the compiler forgoes usage of this in favor of XMAD. Or maybe njuffa can explain.

Are you compiling with an arch switch that matches your architecture (compute capability) ?

Topic		Replies	Views
Integer performance vs floating point CUDA Programming and Performance	5	8117	March 28, 2014
performance of integer vs float CUDA Programming and Performance	10	21144	June 15, 2009
Profiling performance and energy consumption of the basic operations CUDA Programming and Performance	5	53	September 25, 2024
Peak Performance of integer operation CUDA Programming and Performance	3	2841	May 11, 2017
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	2623	October 5, 2022
Forward looking GPU integer performance CUDA Programming and Performance	22	21218	March 20, 2017
Mythical Tflops CUDA Programming and Performance	11	1059	January 14, 2019
16 bit int multiplication using SIMD / mixed precision CUDA Programming and Performance	7	1734	October 12, 2021
"no instruction" stalls every 256 bytes of the binary code CUDA Programming and Performance	7	1513	February 14, 2019
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	19859	March 12, 2014

int32 Vs float32 performance difference and analysis advice

Related topics