How fast is a const array defined in a kernel?

ruochunz · April 19, 2022, 5:09am

This may be a quite basic question. Suppose I define a const array in a kernel function…

__global__ void my_kernel(int* pGlobal) {
    const int arr[5] = {1, 2, 3, 4, 5};
    /*
        each thread does something with arr
        then writes the result back to global memory
    */
}

In this example this array has size 5 but in general we just assume this array is not too large that we can fit it in the L1 cache. I came across this post saying if we define constants with __constant __ specifier then they are available at register level speed, so my question is, can the same be said for a const array defined like above?

Does the nvcc compiler optimize it such that this const array is available at register/shared memory level speed, or much worse than that? And just to make sure,
const arrays defined this way will not be duplicated so each thread has a copy, instead it will live somewhere that all threads in a block can access, correct? (I am 99% sure this is the case though…)

Thank you for taking your time answering this!

njuffa · April 19, 2022, 5:59am

const is not __constant__. The ordinary C++ keyword const has the semantics “read only”. Its primary purpose is as a compile time guard against writes to the data with this attribute.

What you are inquiring about is an implementation artifact. As such, neither the CUDA language specification nor the compiler manual will provide information on how arr will be stored. In fact, how it will be stored likely depends on the usage. The easiest way to find out what happens in the specific context of your code is to compile the source code and then disassemble the machine code with, for example cuobjdump --dump-sass.

Thread-specific data, and this includes arr here, is stored in (thread-) local memory by default. In which case you should expect to see five STL instructions initializing this array in the disassembly, and LDL for every access to it. As an optimization, and depending on heuristics and code context (e.g. re-use), thread-local data may be buffered in registers. In some cases, an array of compile-time literals like arr may become a value in the immediate field of SASS instructions, meaning no data storage is assigned to it at all. It does not imply a storage class.

In terms of performance characteristics, this kind of thread-local const array is unlikely to provide the performance benefits you would get from a __constant__ array (except for the last case mentioned above).

Robert_Crovella · April 19, 2022, 1:17pm

Aside:

Not everything written on the internet is true. I followed the link provided in that post and I don’t find that written in NVIDIA documentation.

__constant__ data is stored in GPU DRAM memory, and will be accessed through a per SM constant cache resource. It’s true that the best access pattern is uniform (all threads in the warp reading the same location), but none of those statements imply “register-level speed”. A register is the highest speed, lowest latency memory resource available on a GPU. Pulling data from global memory through a constant cache should not be equated with that. I would suggest that even a cache hit should not be equated with register level speed. Meaning, I’m not aware of any published data or specs that confirm that equality.

ruochunz · April 19, 2022, 3:52pm

Thank you both for the answer. It is extremely helpful.

ruochunz · April 19, 2022, 3:52pm

Thank you!

rs277 · April 19, 2022, 6:59pm

It is mentioned in the “Best Practices Guide”:

"If all threads of a warp access the same location, then constant memory can be as fast as a register access. "

For the reasons you outline and the measurements published in Table 3.1 of the “Dissecting Turing” paper, the real world broadcast/no conflict latency appears to be slightly worse than or on a par with shared memory, depending on architecture.

Robert_Crovella · April 19, 2022, 7:15pm

thanks!

njuffa · April 19, 2022, 7:29pm

To add to @rs277’s point, best I know what is written in the Best Practices Guide used to be empirically true in earlier GPU architectures: Warp-uniform access to constant memory was basically as fast as a register access. The power consumption measure in pJ probably was worse.

I noticed that in the newer GPU architectures the immediate field of instructions has been widened so it can accommodate a full FP32 operand (previously, only the leading twenty bits of an FP32 operand in most instructions). This increased the number of FP32 compile-time constants stored as immediates, and significantly reduced the use of constant memory by the compiler.

This leads me to suspect that warp-uniform speed of access to constant memory is no longer on par with register access. The architecture of the GPU memory subsystem (and possibly the hardware implementation details of on-chip storage, including registers, itself) has changed significantly across the different GPU architectures.

ruochunz · April 20, 2022, 3:31am

There is a surprising amount of information in it. Thanks to all of you.

kesialana87 · April 21, 2022, 11:57pm

An elementwise kernel can be defined by the Element wise Kernel class. The instance of this class defines a CUDA kernel which can be invoked by the call method of this instance.

A definition of an elementwise kernel consists of four parts: an input argument list, an output argument list, a loop body code, and the kernel name.

Topic		Replies	Views
const __restrict__ read faster than __constant__ ? CUDA Programming and Performance	17	5983	February 26, 2014
Passing an array to the kernel in constant memory CUDA Programming and Performance	3	4461	March 19, 2011
Optimizing a Kernel with a lot of variables - memory allocation. CUDA Programming and Performance	6	738	August 5, 2016
Small const array accessable globally? Is it easy and possible? CUDA Programming and Performance	6	1411	April 16, 2009
In registor or not the parameters defined in program CUDA Programming and Performance	8	8349	July 31, 2008
__constant__ use CUDA Programming and Performance	6	15150	June 14, 2008
thread private array stored in register or local memory? CUDA Programming and Performance	5	1650	June 21, 2021
Dynamic Shared Memory allocation of more than one array CUDA Programming and Performance	4	4537	June 20, 2011
Constant Memory vs. Kernel Input Argument Access Speed CUDA Programming and Performance	7	5239	February 15, 2023
Does the use of 16-bit, __restrict__ const kernel arguments hurt performance? CUDA Programming and Performance	4	4285	May 24, 2018

How fast is a const array defined in a kernel?

Related topics