A question about calculation of integer (or short integer) and float data

As we known, CUDA supports “integer” data-type like “short”. For the large-scale arithmetic calculation, comparing with “float” data-type, is there any intrinsic calculation-speedup mechanism by using “short” data type on GPU? or is there any configurable settings for “integer/short” data-type calculation to get a better performance than using “float”?

Reduced amount of bytes to be loaded from global memory.
Less registers.
Special Function Units (SFUs) in the SMs which work in this short precision.

Thanks your reply that helps me a lot.
For the second item, is it means a standard 32-bits register can cache two “short” data but one “Float” data?

Two “short” values can be stored in one 32-bit register if you use the “short2” or “ushort2” type. There are no arithmetic instructions that operate on data of “(unsigned) short” type. Note that C and C++ semantics require that any data of an integer type narrower than “int” is widened to “int” during expression evaluation. Therefore narrower integer operations would be used only as a consequence of compiler optimization under the “as-if” rule (mening the generated code behaves as-if it were following the abstract model defined by the language standards).

If your application is performing floating-point calculations and you need to reduce storage requirements for source data, you may want to look into the use of textures whose elements are 16-bit floating-point numbers, i.e. basically “half” data type. A worked example can be found here:

https://devtalk.nvidia.com/default/topic/547080/cuda-programming-and-performance/-half-datatype-ieee-754-conformance/post/3831088/#3831088

Thx, njuffa, I will try this~

well there are exceptions to the requirement of widing narrower types such as short2 to int for arithmetics.

let’s say you have 4 unsigned chars packed into a uint32_t type, and you know they’re all less than 0xff, then you can safely add 0x01010101 to this uint32_t to increment all 4 unsigned chars by one.

Alternatively superhippo may look into the SIMD within a word functions that use the “video instructions” that are present in Kepler GPUs (but not in Maxwell unfortunately). I think these handle over/underflow in a predictable way. It’s a download you can get from the registered developers site.

Excellent point about the ability to process short2 in SIMD fashion, although I would consider that orthogonal to the prescribed C/C++ behavior for scalar expressions that I mentioned. The SIMD-within-a-word functions are now available as device intrinsics in CUDA 6.0. The CUDA 6.0 release candidate is available for download by the general public at this time:

https://developer.nvidia.com/cuda-pre-production

Subword (byte or short) storage has an extra occasional benefit: you can arbitrarily swizzle (reorder, copy, or duplicate) the components in just one instruction using the __byte_perm() intrinsic.

As Mr. Juffa says, there are also the powerful SIMD video instructions in Kepler that work on integer subwords, but I no longer recommend them since they are emulated on all other architectures including the latest Maxwell.

Note that there are two different emulations that come into play. For architectures prior to Kepler, the emulation is done at source code level as there is no PTX support. I put a lot of work into making that emulation fast. I received feedback that at least for some byte-wise oeprations on Fermi, the use of the emulated SIMD-in-a-word functions is still faster than the solution previously employed.

Presumably this is partially due to the more efficient handling of memory accesses when SIMD-in-a-word processing is used, and partially due to the overhead of handling each byte lane separately using scalar code,

For Maxwell, some of the hardware support for SIMD-in-a-word operations that exists in Kepler has been removed, and since the SIMD instructions are exposed at PTX level, these instructions are now emulated at that level, i.e. during PTX to SASS translation. I think it was you who noticed at least one case of very suboptimal emulation.

I do not have a clear overview over the exact extent of hardware changes and have not spent more than half a day with a Maxwell GPU. I would encourage customers who encounter very slow emulation of the SIMD instructions on Maxwell to file bugs.