Future support/extension of CUDA SIMD intrinsics

cybernoid · September 29, 2016, 5:30am

I am curious what the future plans may be regards SIMD intrinsic support going forward. From my understanding some of the specific video processing intrinsics are now software emulated for example but ~8/16bit SIMD lane support due to graphics and machine learning applications are probably here to stay a while?

Across CPU/GPU it would be nice from a software perspective to have orthogonal 128bit SIMD support with 64/32/16/8 bit lane configurations for integers and 64/32 bit for floats (Well up to 32bit lanes would be great for now). Obviously even on the CPU side these things aren’t fully orthogonal due to practical matters.

I am not sure how far SIMD within SIMT makes sense and I believe the other hardware vendor has-had/still-has more SIMD orientated design in this area - I am not sure of the realistic pros/cons here for the future. I can imagine it could make sense to go no further then 128bit SIMD on the GPU given typical data workloads/layouts and the fact that it sits within SIMT anyway, and existing 128bit loads/stores with that etc.

Also it seems cross compiling 128bit SIMD CPU code to the GPU could potentially help ILP on the GPU anyway to a degree - so it could be a sensible default?

Are we likely to get 32bit lane SIMD intrinsics with hardware support in the future from CUDA or is this a dead end for some reason?

njuffa · September 29, 2016, 7:03am

I don’t know what NVIDIA’s plans are. Historically NVIDIA’s policy has been to never discuss technical details of future hardware (note the tendency to even avoid discussions on the technical details of shipping parts!). So generally speaking, people outside NVIDIA are unlikely to know and those inside NVIDIA are not free to comment.

Classical explicit SIMD causes significant software engineering challenges when the SIMD width changes, as Intel’s approach amply demonstrates. I would claim the trend is towards implicit SIMD, which is basically the CUDA SIMT approach, but can also be implemented on top of classical SIMD architectures, see Intel’s SPMD compile project [url]https://ispc.github.io/[/url]. Implicit SIMD provides high performance in a way that is flexible and easy for programmers to use, leading to productivity gains.

That said, even though SIMT makes classical SIMD look redundant, at present explicit SIMD can still provide a performance benefit for data that requires less than a full register of storage. With the 32-bit register width on GPUs, this means 8-bit and 16-bit integer types, as well as the 16-bit floating-point type. Offering wider explicit SIMD operations seems fully redundant with SIMT and thus non-sensical.

However, the question then needs to be asked how many applications can benefit from such explicit sub-word SIMD operations. In the case of integer types, that is predominantly image processing and gene processing in the biosciences, in the case of half-precision floating-point that is mostly deep learning. Providing such instructions has associated opportunity costs, i.e. transistors spent on supporting this special hardware are not available for other features.

The fact that NVIDIA removed hardware support for most of the SIMD video instructions post Kepler, and did not re-introduce them in Pascal suggests that the opportunity cost was too high. One should also note that the SIMD video instructions in Kepler were quarter throughput only, which means that emulation using existing 32-bit integer instructions can often provide similar performance benefits, in particular once required data movement in relevant applications is also taken into account.

The case for FP16x2 support may be stronger, considering the still growing importance of deep learning. But as some deep-learning applications shift from GPUs to ASICs in the near future there too the opportunity cost may ultimately prove too high.

cybernoid · September 29, 2016, 7:27am

Thank you for the informative reply. I’m currently working on optionally compiling custom 128bit CPU SIMD code across to the GPU. So I guess it still makes sense to use the kepler intrinsics for now where possible but expect to replace them one day if need be then from what you said - as it shouldn’t generally ‘hurt’ right now.

One other forward looking concern I have is from my current understanding aliasing/casting float/int’s as far as I know doesn’t really cause a performance hit on the GPU - but it can do on the CPU in SSE registers. My understanding is the 32bit float/int registers on the GPU are effectively the same registers despite there being different int/float ALU’s.

So if I have code that is masking floats using bitwise operations in a SIMD fashion is that likely to cause problems in future GPU’s for any foreseeable reason?

njuffa · September 29, 2016, 7:36am

Correct, there is only one kind of register used on GPUs, and it can hold a ‘uint32_t’ or a ‘float’, and a pair of it (aligned to an even register number) can hold a ‘double’.

Therefore re-interpreting an ‘int32_t’ or ‘uint32_t’ into ‘float’ and vice versa has no cost associated with it, which makes diverse bit manipulations on floating-point data easy and convenient. This is exploited quite extensively in the CUDA standard math library, for example. Given that this basic architectural feature has been around for over a decade, I think it is highly unlikely to change.

cybernoid · September 29, 2016, 12:29pm

Thanks again, it’s good to confirm my understanding of that. And now I see CUDA 8 is released so I can move forward using it finally! :-)

Topic		Replies	Views
CUDA intrinsics? CUDA Programming and Performance	7	3600	November 16, 2017
16 bit int multiplication using SIMD / mixed precision CUDA Programming and Performance	7	1880	October 12, 2021
A question about calculation of integer (or short integer) and float data CUDA Programming and Performance	8	3360	April 4, 2014
Faster __vsubus4() implementation CUDA Programming and Performance	3	1241	July 2, 2016
CUDA SUCKS!!! Why <block, thread> cannot be judged by itself CUDA Programming and Performance	20	8179	February 17, 2015
SIMD intrinsics with NVRTC CUDA Programming and Performance	2	696	July 23, 2020
Vector maths on float2, where are the SIMD functions? CUDA Programming and Performance	4	3264	July 9, 2018
Forward looking GPU integer performance CUDA Programming and Performance	22	21655	March 20, 2017
How to use SIMD Video Instructions and why is there no 32/64 bit float version CUDA Programming and Performance	4	1716	October 12, 2021
SIMT == SIMD? CUDA Programming and Performance	4	25982	April 3, 2009

Future support/extension of CUDA SIMD intrinsics

Related topics