Future support/extension of CUDA SIMD intrinsics

I am curious what the future plans may be regards SIMD intrinsic support going forward. From my understanding some of the specific video processing intrinsics are now software emulated for example but ~8/16bit SIMD lane support due to graphics and machine learning applications are probably here to stay a while?

Across CPU/GPU it would be nice from a software perspective to have orthogonal 128bit SIMD support with 64/32/16/8 bit lane configurations for integers and 64/32 bit for floats (Well up to 32bit lanes would be great for now). Obviously even on the CPU side these things aren’t fully orthogonal due to practical matters.

I am not sure how far SIMD within SIMT makes sense and I believe the other hardware vendor has-had/still-has more SIMD orientated design in this area - I am not sure of the realistic pros/cons here for the future. I can imagine it could make sense to go no further then 128bit SIMD on the GPU given typical data workloads/layouts and the fact that it sits within SIMT anyway, and existing 128bit loads/stores with that etc.

Also it seems cross compiling 128bit SIMD CPU code to the GPU could potentially help ILP on the GPU anyway to a degree - so it could be a sensible default?

Are we likely to get 32bit lane SIMD intrinsics with hardware support in the future from CUDA or is this a dead end for some reason?

I don’t know what NVIDIA’s plans are. Historically NVIDIA’s policy has been to never discuss technical details of future hardware (note the tendency to even avoid discussions on the technical details of shipping parts!). So generally speaking, people outside NVIDIA are unlikely to know and those inside NVIDIA are not free to comment.

Classical explicit SIMD causes significant software engineering challenges when the SIMD width changes, as Intel’s approach amply demonstrates. I would claim the trend is towards implicit SIMD, which is basically the CUDA SIMT approach, but can also be implemented on top of classical SIMD architectures, see Intel’s SPMD compile project https://ispc.github.io/. Implicit SIMD provides high performance in a way that is flexible and easy for programmers to use, leading to productivity gains.

That said, even though SIMT makes classical SIMD look redundant, at present explicit SIMD can still provide a performance benefit for data that requires less than a full register of storage. With the 32-bit register width on GPUs, this means 8-bit and 16-bit integer types, as well as the 16-bit floating-point type. Offering wider explicit SIMD operations seems fully redundant with SIMT and thus non-sensical.

However, the question then needs to be asked how many applications can benefit from such explicit sub-word SIMD operations. In the case of integer types, that is predominantly image processing and gene processing in the biosciences, in the case of half-precision floating-point that is mostly deep learning. Providing such instructions has associated opportunity costs, i.e. transistors spent on supporting this special hardware are not available for other features.

The fact that NVIDIA removed hardware support for most of the SIMD video instructions post Kepler, and did not re-introduce them in Pascal suggests that the opportunity cost was too high. One should also note that the SIMD video instructions in Kepler were quarter throughput only, which means that emulation using existing 32-bit integer instructions can often provide similar performance benefits, in particular once required data movement in relevant applications is also taken into account.

The case for FP16x2 support may be stronger, considering the still growing importance of deep learning. But as some deep-learning applications shift from GPUs to ASICs in the near future there too the opportunity cost may ultimately prove too high.

Thank you for the informative reply. I’m currently working on optionally compiling custom 128bit CPU SIMD code across to the GPU. So I guess it still makes sense to use the kepler intrinsics for now where possible but expect to replace them one day if need be then from what you said - as it shouldn’t generally ‘hurt’ right now.

One other forward looking concern I have is from my current understanding aliasing/casting float/int’s as far as I know doesn’t really cause a performance hit on the GPU - but it can do on the CPU in SSE registers. My understanding is the 32bit float/int registers on the GPU are effectively the same registers despite there being different int/float ALU’s.

So if I have code that is masking floats using bitwise operations in a SIMD fashion is that likely to cause problems in future GPU’s for any foreseeable reason?

Correct, there is only one kind of register used on GPUs, and it can hold a ‘uint32_t’ or a ‘float’, and a pair of it (aligned to an even register number) can hold a ‘double’.

Therefore re-interpreting an ‘int32_t’ or ‘uint32_t’ into ‘float’ and vice versa has no cost associated with it, which makes diverse bit manipulations on floating-point data easy and convenient. This is exploited quite extensively in the CUDA standard math library, for example. Given that this basic architectural feature has been around for over a decade, I think it is highly unlikely to change.

Thanks again, it’s good to confirm my understanding of that. And now I see CUDA 8 is released so I can move forward using it finally! :-)