How fast is a const array defined in a kernel?

This may be a quite basic question. Suppose I define a const array in a kernel function…

__global__ void my_kernel(int* pGlobal) {
    const int arr[5] = {1, 2, 3, 4, 5};
    /*
        each thread does something with arr
        then writes the result back to global memory
    */
}

In this example this array has size 5 but in general we just assume this array is not too large that we can fit it in the L1 cache. I came across this post saying if we define constants with __constant __ specifier then they are available at register level speed, so my question is, can the same be said for a const array defined like above?

  • Does the nvcc compiler optimize it such that this const array is available at register/shared memory level speed, or much worse than that? And just to make sure,
  • const arrays defined this way will not be duplicated so each thread has a copy, instead it will live somewhere that all threads in a block can access, correct? (I am 99% sure this is the case though…)

Thank you for taking your time answering this!

const is not __constant__. The ordinary C++ keyword const has the semantics “read only”. Its primary purpose is as a compile time guard against writes to the data with this attribute.

What you are inquiring about is an implementation artifact. As such, neither the CUDA language specification nor the compiler manual will provide information on how arr will be stored. In fact, how it will be stored likely depends on the usage. The easiest way to find out what happens in the specific context of your code is to compile the source code and then disassemble the machine code with, for example cuobjdump --dump-sass.

Thread-specific data, and this includes arr here, is stored in (thread-) local memory by default. In which case you should expect to see five STL instructions initializing this array in the disassembly, and LDL for every access to it. As an optimization, and depending on heuristics and code context (e.g. re-use), thread-local data may be buffered in registers. In some cases, an array of compile-time literals like arr may become a value in the immediate field of SASS instructions, meaning no data storage is assigned to it at all. It does not imply a storage class.

In terms of performance characteristics, this kind of thread-local const array is unlikely to provide the performance benefits you would get from a __constant__ array (except for the last case mentioned above).

Aside:

Not everything written on the internet is true. I followed the link provided in that post and I don’t find that written in NVIDIA documentation.

__constant__ data is stored in GPU DRAM memory, and will be accessed through a per SM constant cache resource. It’s true that the best access pattern is uniform (all threads in the warp reading the same location), but none of those statements imply “register-level speed”. A register is the highest speed, lowest latency memory resource available on a GPU. Pulling data from global memory through a constant cache should not be equated with that. I would suggest that even a cache hit should not be equated with register level speed. Meaning, I’m not aware of any published data or specs that confirm that equality.

Thank you both for the answer. It is extremely helpful.

Thank you!

It is mentioned in the “Best Practices Guide”:

"If all threads of a warp access the same location, then constant memory can be as fast as a register access. "

For the reasons you outline and the measurements published in Table 3.1 of the “Dissecting Turing” paper, the real world broadcast/no conflict latency appears to be slightly worse than or on a par with shared memory, depending on architecture.

thanks!

To add to @rs277’s point, best I know what is written in the Best Practices Guide used to be empirically true in earlier GPU architectures: Warp-uniform access to constant memory was basically as fast as a register access. The power consumption measure in pJ probably was worse.

I noticed that in the newer GPU architectures the immediate field of instructions has been widened so it can accommodate a full FP32 operand (previously, only the leading twenty bits of an FP32 operand in most instructions). This increased the number of FP32 compile-time constants stored as immediates, and significantly reduced the use of constant memory by the compiler.

This leads me to suspect that warp-uniform speed of access to constant memory is no longer on par with register access. The architecture of the GPU memory subsystem (and possibly the hardware implementation details of on-chip storage, including registers, itself) has changed significantly across the different GPU architectures.

There is a surprising amount of information in it. Thanks to all of you.

An elementwise kernel can be defined by the Element wise Kernel class. The instance of this class defines a CUDA kernel which can be invoked by the call method of this instance.

A definition of an elementwise kernel consists of four parts: an input argument list, an output argument list, a loop body code, and the kernel name.