A question about calculation of integer (or short integer) and float data

superhippo · April 3, 2014, 3:05pm

As we known, CUDA supports “integer” data-type like “short”. For the large-scale arithmetic calculation, comparing with “float” data-type, is there any intrinsic calculation-speedup mechanism by using “short” data type on GPU? or is there any configurable settings for “integer/short” data-type calculation to get a better performance than using “float”?

pasoleatis · April 3, 2014, 3:37pm

Reduced amount of bytes to be loaded from global memory.
Less registers.
Special Function Units (SFUs) in the SMs which work in this short precision.

superhippo · April 3, 2014, 3:43pm

Thanks your reply that helps me a lot.
For the second item, is it means a standard 32-bits register can cache two “short” data but one “Float” data?

njuffa · April 3, 2014, 6:18pm

Two “short” values can be stored in one 32-bit register if you use the “short2” or “ushort2” type. There are no arithmetic instructions that operate on data of “(unsigned) short” type. Note that C and C++ semantics require that any data of an integer type narrower than “int” is widened to “int” during expression evaluation. Therefore narrower integer operations would be used only as a consequence of compiler optimization under the “as-if” rule (mening the generated code behaves as-if it were following the abstract model defined by the language standards).

If your application is performing floating-point calculations and you need to reduce storage requirements for source data, you may want to look into the use of textures whose elements are 16-bit floating-point numbers, i.e. basically “half” data type. A worked example can be found here:

[url]'half' datatype - IEEE 754 conformance - CUDA Programming and Performance - NVIDIA Developer Forums

superhippo · April 3, 2014, 6:54pm

Thx, njuffa, I will try this~

cbuchner1 · April 4, 2014, 9:09am

well there are exceptions to the requirement of widing narrower types such as short2 to int for arithmetics.

let’s say you have 4 unsigned chars packed into a uint32_t type, and you know they’re all less than 0xff, then you can safely add 0x01010101 to this uint32_t to increment all 4 unsigned chars by one.

Alternatively superhippo may look into the SIMD within a word functions that use the “video instructions” that are present in Kepler GPUs (but not in Maxwell unfortunately). I think these handle over/underflow in a predictable way. It’s a download you can get from the registered developers site.

njuffa · April 4, 2014, 6:43pm

Excellent point about the ability to process short2 in SIMD fashion, although I would consider that orthogonal to the prescribed C/C++ behavior for scalar expressions that I mentioned. The SIMD-within-a-word functions are now available as device intrinsics in CUDA 6.0. The CUDA 6.0 release candidate is available for download by the general public at this time:

[url]https://developer.nvidia.com/cuda-pre-production[/url]

SPWorley · April 4, 2014, 9:59pm

Subword (byte or short) storage has an extra occasional benefit: you can arbitrarily swizzle (reorder, copy, or duplicate) the components in just one instruction using the __byte_perm() intrinsic.

As Mr. Juffa says, there are also the powerful SIMD video instructions in Kepler that work on integer subwords, but I no longer recommend them since they are emulated on all other architectures including the latest Maxwell.

njuffa · April 4, 2014, 10:27pm

Note that there are two different emulations that come into play. For architectures prior to Kepler, the emulation is done at source code level as there is no PTX support. I put a lot of work into making that emulation fast. I received feedback that at least for some byte-wise oeprations on Fermi, the use of the emulated SIMD-in-a-word functions is still faster than the solution previously employed.

Presumably this is partially due to the more efficient handling of memory accesses when SIMD-in-a-word processing is used, and partially due to the overhead of handling each byte lane separately using scalar code,

For Maxwell, some of the hardware support for SIMD-in-a-word operations that exists in Kepler has been removed, and since the SIMD instructions are exposed at PTX level, these instructions are now emulated at that level, i.e. during PTX to SASS translation. I think it was you who noticed at least one case of very suboptimal emulation.

I do not have a clear overview over the exact extent of hardware changes and have not spent more than half a day with a Maxwell GPU. I would encourage customers who encounter very slow emulation of the SIMD instructions on Maxwell to file bugs.

Topic		Replies	Views
16 bit int multiplication using SIMD / mixed precision CUDA Programming and Performance	7	1722	October 12, 2021
CUDA intrinsics? CUDA Programming and Performance	7	3415	November 16, 2017
Faster __vsubus4() implementation CUDA Programming and Performance	3	1235	July 2, 2016
Future support/extension of CUDA SIMD intrinsics CUDA Programming and Performance	4	2359	September 29, 2016
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	19855	March 12, 2014
Examining the generated .ptx file CUDA Programming and Performance	13	2322	October 24, 2014
error when trying to use half (fp16) CUDA Programming and Performance	16	19732	October 13, 2015
Floating-point precision problems CUDA Programming and Performance	14	4352	January 7, 2011
Bug with integer division? CUDA Programming and Performance	33	9310	September 9, 2015
Forward looking GPU integer performance CUDA Programming and Performance	22	21190	March 20, 2017

A question about calculation of integer (or short integer) and float data

Related topics