How to use SIMD Video Instructions and why is there no 32/64 bit float version

Ahessian · November 23, 2020, 5:44pm

CUDA Math API :: CUDA Toolkit Documentation (nvidia.com)
PTX ISA :: CUDA Toolkit Documentation (nvidia.com)

It gives me:

No vadd4 instruction.

njuffa · November 23, 2020, 6:34pm

These SIMD instructions were introduced with the sm_30 architecure. Most SIMD instructions were removed for architectures >= sm_50, with efficient emulation code supplied to achieve backward compatibility. Note that on sm_3x, the native SIMD instructions have only 1/4 the throughout of regular integer instructions.

With a register size of 32 bit on the GPU, SIMD floating-point instructions only make sense with a floating-point type smaller than 32 bits. No such floating-point instructions were supported for sm_30, sm_35, and sm_37.

Support for 16-bit floating-point operations was added in later architectures, and support for the half2 type added. This is a 2-vector of floating-point operands. See the relevant intrinsics for this type in the documentation:

https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF2__ARITHMETIC.html

njuffa · November 23, 2020, 6:49pm

There is no need for wide explicit SIMD in CUDA since data parallelism wider than a register is implicit in the programming model. This is a much superior programming model. I am saying this as someone who participated in the design of a SIMD architecture (AMD’s 3DNow!) and programmed with explicit SIMD for a number of years.

The point of a limited number SIMD intrinsics in CUDA (and corresponding SIMD instructions in the GPU hardware) is to exploit SIMD for sub-register operand sizes. As the experience with the SIMD “video” instruction shows, this can be of questionable value.

Robert_Crovella · November 23, 2020, 7:09pm

no, you should spread the vectorized operation across threads in a warp or block.

You might want to take a look at the matrix multiplication code in the programming guide, as a kind of “simple” example of how “vectorization” is typically done in CUDA.

Topic		Replies	Views
Vector maths on float2, where are the SIMD functions? CUDA Programming and Performance	4	3156	July 9, 2018
Future support/extension of CUDA SIMD intrinsics CUDA Programming and Performance	4	2372	September 29, 2016
SIMD intrinsics with NVRTC CUDA Programming and Performance	2	690	July 23, 2020
Vector operations, swizzle and macros in CUDA CUDA Programming and Performance	3	8663	May 20, 2009
SIMD on GPU CUDA Programming and Performance	6	17806	April 29, 2009
16 bit int multiplication using SIMD / mixed precision CUDA Programming and Performance	7	1822	October 12, 2021
Is the instruction "mov.b32 {r,g,b,a},%r1;" even supported?? CUDA Programming and Performance	8	1703	July 18, 2016
Faster __vsubus4() implementation CUDA Programming and Performance	3	1237	July 2, 2016
Need help writing a CUDA kernel for image shifting CUDA Programming and Performance	16	1453	June 16, 2022
Bug on Windows CUDA Programming and Performance	5	2156	August 9, 2011

How to use SIMD Video Instructions and why is there no 32/64 bit float version

Related topics