How to use SIMD Video Instructions and why is there no 32/64 bit float version

CUDA Math API :: CUDA Toolkit Documentation (
PTX ISA :: CUDA Toolkit Documentation (


It gives me:

No vadd4 instruction.

These SIMD instructions were introduced with the sm_30 architecure. Most SIMD instructions were removed for architectures >= sm_50, with efficient emulation code supplied to achieve backward compatibility. Note that on sm_3x, the native SIMD instructions have only 1/4 the throughout of regular integer instructions.

With a register size of 32 bit on the GPU, SIMD floating-point instructions only make sense with a floating-point type smaller than 32 bits. No such floating-point instructions were supported for sm_30, sm_35, and sm_37.

Support for 16-bit floating-point operations was added in later architectures, and support for the half2 type added. This is a 2-vector of floating-point operands. See the relevant intrinsics for this type in the documentation:

1 Like

There is no need for wide explicit SIMD in CUDA since data parallelism wider than a register is implicit in the programming model. This is a much superior programming model. I am saying this as someone who participated in the design of a SIMD architecture (AMD’s 3DNow!) and programmed with explicit SIMD for a number of years.

The point of a limited number SIMD intrinsics in CUDA (and corresponding SIMD instructions in the GPU hardware) is to exploit SIMD for sub-register operand sizes. As the experience with the SIMD “video” instruction shows, this can be of questionable value.

1 Like

no, you should spread the vectorized operation across threads in a warp or block.

You might want to take a look at the matrix multiplication code in the programming guide, as a kind of “simple” example of how “vectorization” is typically done in CUDA.

1 Like