Does the use of 16-bit, restrict const kernel arguments hurt performance?

Olumide · May 23, 2018, 9:55pm

This is surely the most noobish CUDA question ever so please bear with me.

I’ve been programming CUDA for just about 1 year now. A few weeks ago I decided to pause development and read a handful of books on CUDA programming in order to get a handle on some stuff I’ve never quite understood. Three weeks on, I’m realizing that some of stuff that I’ve been doing may in fact be hurting performance. Take for example the following type of kernel you’d currently see in my code base:

__global__ void myKernel( const short * __restrict__ src , short* __restrict__ dst )
{
   dst[ threadIdx.x ] = src[ threadIdx.x ] + 42;
}

The problems with this kernel as I understand them are as follows:

The use of shorts. From all I've read it seems to me that CUDA does not particularly "like" data types less than 32bits and nvcc will in this case generate additional instructions to convert short local variables to and from 32-bits; because the granularity of the register file is 32-bits. Question: is this still true of the latest CUDA architectures? I've come across some PTX instructions that work on 16-bit integers. Is there any "penalty" associated with such instructions?
If I'm not mistaken, the use of const with __restrict__ directs memory reads can be through the read-only cache which has a granularity of 32 bytes. Whereas if both L1 and L2 caches are used, a memory access is serviced by a 128-byte memory transaction. My thinking here was to confuse __restrict__ with its C equivalent restrict. Clearly they have different meanings. Question: is there a mechanism for informing nvcc that two pointer do not alias each other?
I'm not certain about the use of const pointer, without __restrict__. Is there a penalty associated with the use of const? e.g. would the read be serviced through the constant cache?

njuffa · May 24, 2018, 3:35am

“restrict” in CUDA is basically equivalent to C’s “restrict”. CUDA is a language in the C++ family, and ISO C++ has yet to standardize such a keyword. So it is a vendor extension, thus the two underscores. It is basically a promise by the programmer to the compiler that access to the data object is restricted to that particular pointer, i.e. there is no pointer aliasing. If the programmer violates that promise, functionally incorrect code can result.

The kernel shown is so simple that the use of restrict is unlikely to have any impact on code generation, but in general the use of restricted pointers provides additional information to the compiler that allows it to optimize more aggressively, in particular with respect to the order of memory accesses is.

On some GPU architectures the use of “const” with pointers in conjunction with “restrict” may make it easier for the compiler to identify opportunities for the use of loads through non-coherent caches, which may be beneficial to performance.

Depending on what publications you are reading, you may encounter incorrect descriptions that say that the use of “const restrict” pointers results in the use of loads through non-coherent caches. Such descriptions are in error. While there is a potential connection as explained in the previous paragraph, there is no direct relationship.

As a general rule of thumb, with C/C++ on any programming platform, integer data wants to be ‘int’, unless there is a darn good reason for it to be something else, such as an unsigned integer type or a narrower / wider type. In general, the use of 16-bit types may cause additional conversion instructions to be used. You can easily double check whether that applies to a particular case by disassembling the generated machine code. In the case of CUDA, you would use cuobjdump --dump-sass for this.

Modern CUDA supports some packed (also called vectorized) 16-bit types that are a good match for the 32-bit registers of the GPU.

The specific example kernel is memory bound, and on the most recent GPU architectures (Pascal, Volta) I would not expect to see differences in memory throughput when using “short” vs “int” (in terms of GB/sec, not elements copied per second). It’s easy enough to try it both ways to see the impact on your particular GPU.

Robert_Crovella · May 24, 2018, 3:51am

The only way typical data (e.g. not kernel parameters) will get loaded through the constant cache is if the data is in a static allocation marked with constant

No other syntax or usages I know of will employ the constant cache (excepting the aforementioned usage for storage of kernel parameters themselves).

The expected usage of constant data is via uniform access. Uniform access means that all (participating) threads in a warp are reading the same location, ie. same address.

In my view, and as already pointed out in the excellent response by njuffa, I would encourage you to think about the use of const and restrict as generally beneficial and “the right thing to do”. Giving additional (accurate) information to the compiler should always be an equal or better situation, compared with omitting such information even if it is known a-priori. Of course the world is not a perfect place, so its possible that there might be a measurable degradation, but that should be considered a compiler defect, not a programming error.

In that view, the correct mechanism to inform nvcc that two pointers do not alias one another (or anything else) is to use restrict decorations. This philosophy should be consistent with C and C++ as well.

It should be possible to write kernels that work either on underlying int data or underlying short data that do a trivial copy operation like you are showing, with equivalent bandwidth/efficiency. Before I would actually make that statement about your code, I would like to see the int/short exact comparison case you have in mind. Your proposed code could only be sensibly used to transfer a single block’s worth of data, and such a tiny measurement will get lost in other overheads, even if there were a significant efficiency difference between the use of int and the use of short. A straightforward use of short only loads (or stores) 64 bytes per warp, instead of 128 bytes per warp, therefore I would expect some additional overhead for the use of short vs. int, when looking at things from the perspective of bytes/s (to load 128 bytes using one short per thread, two instructions warp-wide would have to be scheduled onto LD/ST units, whereas to move 128 bytes with int operations would only require one instruction, warp-wide, dispatched to LD/ST units). Whether it is measurable or not, I can’t say without a test case. Any subtle difference there might get lost by the time the memory controllers are doing their work.

Olumide · May 24, 2018, 9:37pm

Thanks njuffa and txtbob for your answers.

The kernel in the question is nothing like my actual kernels apart from the pointer type and qualifiers. And as you said my example is so simple that restrict has no impact on the PTX generated. The same however is true of the examples I found on the nvidia blog on aliasing. Has nvcc gotten smarter?

I read about the granularity of the read-only cache in Professional CUDA C Programming by By John Cheng, Max Grossman, Ty McKercher:

Is this information still accurate/up to date?

njuffa · May 24, 2018, 10:48pm

Like all actively maintained compilers, the CUDA compiler undergoes continuous improvement. However, one would have to compare specific source code compiled with specific versions of CUDA to determine whether there has been improvement for that particular code. In general, at any given point in time, most improvements tend to be targeted at the two most recent GPU architectures (which makes economic sense).

In general, for code comparisons you would want to look at the generated machine code (SASS), rather than PTX. In the CUDA toolchain, PTX code (which serves as a virtual instruction set rchitecture AND a compiler intermediate format) is compiled to SASS by the PTXAS component of the compiler, which is an optimizing compiler, contrary to what the name might suggest.

Since NVIDIA is overly secretive with specifics of the GPU microarchitectures, requiring programmers to scrape little snippets of information from available public documentation here and there and augmenting that with the help of micro-benchmarks, I have given up on trying to keep track of GPU hardware characteristics. Other forum participants make more of an effort and are better qualified to answer questions about microarchitectural specifics.

Does the use of 16-bit, __restrict__ const kernel arguments hurt performance?

Does the use of 16-bit, restrict const kernel arguments hurt performance?