CUDA Math API :: CUDA Toolkit Documentation (nvidia.com)
PTX ISA :: CUDA Toolkit Documentation (nvidia.com)
It gives me:
No vadd4 instruction.
CUDA Math API :: CUDA Toolkit Documentation (nvidia.com)
PTX ISA :: CUDA Toolkit Documentation (nvidia.com)
It gives me:
No vadd4 instruction.
These SIMD instructions were introduced with the sm_30
architecure. Most SIMD instructions were removed for architectures >= sm_50
, with efficient emulation code supplied to achieve backward compatibility. Note that on sm_3x
, the native SIMD instructions have only 1/4 the throughout of regular integer instructions.
With a register size of 32 bit on the GPU, SIMD floating-point instructions only make sense with a floating-point type smaller than 32 bits. No such floating-point instructions were supported for sm_30
, sm_35
, and sm_37
.
Support for 16-bit floating-point operations was added in later architectures, and support for the half2
type added. This is a 2-vector of floating-point operands. See the relevant intrinsics for this type in the documentation:
https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF2__ARITHMETIC.html
There is no need for wide explicit SIMD in CUDA since data parallelism wider than a register is implicit in the programming model. This is a much superior programming model. I am saying this as someone who participated in the design of a SIMD architecture (AMD’s 3DNow!) and programmed with explicit SIMD for a number of years.
The point of a limited number SIMD intrinsics in CUDA (and corresponding SIMD instructions in the GPU hardware) is to exploit SIMD for sub-register operand sizes. As the experience with the SIMD “video” instruction shows, this can be of questionable value.
no, you should spread the vectorized operation across threads in a warp or block.
You might want to take a look at the matrix multiplication code in the programming guide, as a kind of “simple” example of how “vectorization” is typically done in CUDA.