Where do const parameters of kernel functions reside?

As far as I know, to keep a variable in the constant memory, we must use the constant qualifier. However, I often see people declare const parameters in their kernels, something like

void myKernel(float *A, const int B){
}

I wonder where do these const parameters reside? In the global memory or the constant memory?

Thank you!

All kernel parameters (const-qualified or not) are initially loaded into constant memory.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-parameters

Thank for the response.

However, what I am concerned with is where these parameters eventually end up. Say if I were to refer to a const parameter in the body of the kernel function, would it be fetched from the constant memory or the global memory?

Also, what does this imply as to the performance? It seems to me that if all threads refer to the parameter at the same time, it will be faster if it is in the constant memory. Am I correct about this?

What exactly are you concerned about, and why? Data in constant memory resides in exactly the same physical DRAM as global memory. It is just that caching behavior and some usage semantics differ (see programming guide). Similarly for local memory: it used the same physical DRAM as global memory, but provides a different, per-thread, address mapping.

If constant memory data is cached, the speed of cache access is effectively the same performance as a register access. Constant data is useful and intended for data for which access is uniform across the warp (i.e. all threads in the warp present the same address): with one physical access the data can then be broadcasted to each thread in the warp,

The const qualifier has nothing to do with CUDA’s constant memory. It is inherited from standard C and C++ and simply means “read-only”. The compiler will flag any write access to a const-qualified data object as an error.

Thanks for the response.

It is exactly the cache and broadcast behavior of constant memory that I am concerned about, since in my applications, usually the parameters with the const qualifier are accessed uniformly, so it would be great if they could be cached and broadcast. So according to your answer, the const parameters will eventually end up in the global memory like other parameters, and to force them to be in the constant memory I will have to use the constant qualifier. Have I understood correctly?

Again, “const” and “constant” are two different things. If they were the same thing, there would be no need for “constant” to exist.

As Robert Crovella already stated, at kernel launch time, all kernel arguments are stored in constant memory (more precisely, a specific constant memory bank), regardless of whether they are “const” qualified in the source code. Take your example kernel:

void myKernel(float *A, const int B)

As kernel execution commences, we find the pointer A stored in constant memory, and the integer B stored in constant memory. The compile may elect to move some of the kernel arguments into registers later, but from what I can see in disassembled code that rarely happens.

So the kernel parameters are initially stored in the constant memory, may end up in the registers, but in no case will they be stored in the global memory? And I do not need to explicitly use the constant qualifier to make the kernel parameters benefit from the cache and broadcast mechanisms?

The kernel arguments go into a specific constant memory bank automatically, you don’t have to do anything to make that happen. They don’t go into global memory.

Just keep in mind that constant memory, global memory, local memory all refer to the same physical DRAM connected to the GPU. In the end these are just different “viewports” on the same memory. Data items from all three memory types are cacheable, and while there were physically separate caches in early GPUs I am reasonably sure they all go through one big unified cache on current hardware.

Also keep in mind that when you pass a pointer to a kernel, while the pointer is stored in constant memory when the kernel starts, the data it points to sits in global memory.

If you are concerned about performance, I would strongly suggest acquainting yourself with the CUDA profiler. It will quickly steer you to the bottlenecks in your code, and you can avoid worries about minor implementation details such as how kernel arguments are passed.

I think one needs to tread carefully here with terminology. GPUs do not use a strict load/store architecture. Various GPU instructions allow one of their source operands to be a constant bank memory location, making these load-execute type of instructions. This deviation from a strict load/store architecture allows GPUS to be more efficient than RISC processor architectures in various practically relevant use cases.

Below is the disassembly of a kernel. The references to c[0x0] represent constant memory locations in .param space, i.e. kernel arguments. The references to c[0x2] are references to the constant memory bank used by the compiler (these come from literal constants in the source code). From the disassembly we can also see that various instructions allow the storage of data inside the instruction itself (immediate data).

/* 0x001c7c00e22007f6 */
       /*0008*/                   MOV R1, c[0x0][0x20];                                           /* 0x4c98078000870001 */
       /*0010*/                   S2R R0, SR_CTAID.X;                                             /* 0xf0c8000002570000 */
       /*0018*/                   S2R R2, SR_TID.X;                                               /* 0xf0c8000002170002 */
                                                                                                  /* 0x001fd840fec20ff1 */
       /*0028*/                   XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ;                     /* 0x4f107f8000270003 */
       /*0030*/                   XMAD R2, R0.reuse, c[0x0] [0x8], R2;                            /* 0x4e00010000270002 */
       /*0038*/                   XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2;                             /* 0x5b30011800370000 */
                                                                                                  /* 0x081fc400ffa007ed */
       /*0048*/                   ISETP.GE.AND P0, PT, R0, c[0x0][0x150], PT;                     /* 0x4b6d038005470007 */
       /*0050*/               @P0 EXIT;                                                           /* 0xe30000000000000f */
       /*0058*/                   SHL R2, R0.reuse, 0x2;                                          /* 0x3848000000270002 */
                                                                                                  /* 0x001fc800fec007f5 */
       /*0068*/                   SHR R3, R0, 0x1e;                                               /* 0x3829000001e70003 */
       /*0070*/                   IADD R2.CC, R2, c[0x0][0x140];                                  /* 0x4c10800005070202 */
       /*0078*/                   IADD.X R3, R3, c[0x0][0x144];                                   /* 0x4c10080005170303 */
                                                                                                  /* 0x001fd800fc200071 */
       /*0088*/                   LDG.E.CI R4, [R2];                                              /* 0xeed4a00000070204 */
       /*0090*/                   MOV32I R7, 0x3e2c7e60;                                          /* 0x0103e2c7e607f007 */
       /*0098*/                   MOV R9, c[0x0][0x8];                                            /* 0x4c98078000270009 */
                                                                                                  /* 0x001fd808fda007f1 */
       /*00a8*/                   XMAD.MRG R8, R9, c[0x0] [0x14].H1, RZ;                          /* 0x4f107f8000570908 */
       /*00b0*/                   FSETP.LT.AND P0, PT, R4, 1.1754943508222875e-038, PT;           /* 0x36b1038080070407 */
       /*00b8*/               @P0 FMUL R4, R4, 8388608;                                           /* 0x3868004b00000404 */
                                                                                                  /* 0x001f9800fec207f6 */
       /*00c8*/                   IADD32I R5, R4.reuse, -0x3f3504f3;                              /* 0x1c0c0cafb0d70405 */
       /*00d0*/                   LOP32I.AND R5, R5, 0xff800000;                                  /* 0x040ff80000070505 */
       /*00d8*/                   IADD R6, R4, -R5;                                               /* 0x5c11000000570406 */
                                                                                                  /* 0x081fd840fec007f6 */
       /*00e8*/                   FADD R6, R6, -1;                                                /* 0x3958003f80070606 */
       /*00f0*/                   FFMA R7, R6.reuse, 0.0970458984375, -R7;                        /* 0x328203bdc6c70607 */
       /*00f8*/                   FFMA R7, R6.reuse, R7, c[0x2][0x4];                             /* 0x5180038800170607 */
                                                                                                  /* 0x001f9840fec207f6 */
       /*0108*/                   FFMA R3, R6.reuse, R7, c[0x2][0x8];                             /* 0x5180038800270603 */
       /*0110*/                   FFMA R3, R6.reuse, R3, c[0x2][0xc];                             /* 0x5180018800370603 */
       /*0118*/                   FFMA R3, R6, R3, c[0x2][0x10];                                  /* 0x5180018800470603 */
                                                                                                  /* 0x081fc440fec007fd */
       /*0128*/                   FFMA R3, R6, R3, c[0x2][0x14];                                  /* 0x5180018800570603 */
       /*0130*/                   FFMA R3, R6.reuse, R3, c[0x2][0x18];                            /* 0x5180018800670603 */
       /*0138*/                   FFMA R3, R6.reuse, R3, c[0x2][0x1c];                            /* 0x5180018800770603 */
                                                                                                  /* 0x0000c440fe0007f5 */
       /*0148*/                   FSETP.GEU.AND P1, PT, R4, +INF , PT;                            /* 0x36be03ff8007040f */
       /*0150*/         {         FFMA R7, R6.reuse, R3, c[0x2][0x20];                            /* 0x5180018800870607 */
       /*0158*/                   I2F.F32.S32 R3, R5;        }                                    /* 0x5cb8000000572a03 */
                                                                                                  /* 0x001fc401fc4007f5 */
       /*0168*/                   SEL R2, RZ, c[0x2][0x0], !P0;                                   /* 0x4ca004080007ff02 */
       /*0170*/                   FMUL R5, R6, R7;                                                /* 0x5c68000000770605 */
       /*0178*/                   FSETP.LEU.OR P0, PT, R4, RZ, P1;                                /* 0x5bbb20800ff70407 */
                                                                                                  /* 0x005fd440fe2207f3 */
       /*0188*/                   XMAD R7, R9.reuse, c[0x0] [0x14], R0;                           /* 0x4e00000000570907 */
       /*0190*/                   FMUL R5, R6.reuse, R5;                                          /* 0x5c68000000570605 */
       /*0198*/                   FFMA R2, R3, 1.1920928955078125e-007, R2;                       /* 0x3280013400070302 */
                                                                                                  /* 0x001fc800fca007f1 */
       /*01a8*/                   FFMA R3, R6, c[0x2][0x24], R5;                                  /* 0x4980028800970603 */
       /*01b0*/                   XMAD.PSL.CBCC R6, R9.H1, R8.H1, R7;                             /* 0x5b30039800870906 */
       /*01b8*/                   FADD R5, R2, R3;                                                /* 0x5c58000000370205 */
                                                                                                  /* 0x081fc400fe200032 */
       /*01c8*/               @P0 MUFU.LG2 R5, R4;                                                /* 0x5080000000300405 */
       /*01d0*/                   ISETP.LT.AND P0, PT, R6, c[0x0][0x150], PT;                     /* 0x4b63038005470607 */
       /*01d8*/                   SHR R3, R0.reuse, 0x1f;                                         /* 0x3829000001f70003 */
                                                                                                  /* 0x001fc000fc4207f6 */
       /*01e8*/                   LEA R2.CC, R0.reuse, c[0x0][0x148], 0x2;                        /* 0x4bd7810005270002 */
       /*01f0*/                   LEA.HI.X R3, R0, c[0x0][0x14c], R3, 0x2;                        /* 0x1a17018005370003 */
       /*01f8*/         {         MOV R0, R6;                                                     /* 0x5c98078000670000 */
       /*0208*/                   STG.E [R2], R5;        }                                        /* 0x001ffc01ffa010fd */
                                                                                                  /* 0xeedc200000070205 */
       /*0210*/               @P0 BRA 0x58;                                                       /* 0xe2400fffe400000f */
       /*0218*/                   EXIT;                                                           /* 0xe30000000007000f */
                                                                                                  /* 0x001f8000fc0007ff */
       /*0228*/                   BRA 0x220;                                                      /* 0xe2400fffff07000f */
       /*0230*/                   NOP;                                                            /* 0x50b0000000070f00 */
       /*0238*/                   NOP;                                                            /* 0x50b0000000070f00 */

true
removed my comment

FWIW, I consider edits using strike-through superior to removal, as it preserves continuity.

Thank you both for answering! It has been really helpful.

Global kernel parameters are copied into global memory (system or device) as part of the launch sequence and referenced through the per stream constant bank (c[0]). The parameter are generally referenced as immediate constants in many opcode but can also be referenced as indexed constants through the LDC (load indexed constants) instruction. Both methods load the constant through the constant memory hierarchy which results in a global memory load at an address not known to the kernel.

The SASS code posted by njuffa is for Maxwell or Pascal architecture. The first kernel parameter is loaded from c[0][0x140] on line 15. This base address is shifted on Volta/Turing to 0x160. There are other immediate constants in the start of the code. For example,

c[0][0x020] initial address of the stack pointer
c[0][0x008] blockDim.x (blockDim is offset 8-19, gridDim is offset 20-31)
c[0][0x140] first kernel parameter

These offsets may be different on other architectures.

Let me quibble a bit about terminology.

(1) Immediate constants are data embedded in instructions (integer of floating-point). No indirection: what’s stored in the instruction is the data.

(2) Constant memory references are indexes into memory (first index selects constant bank, second index the location within the bank). One level of indirection: what’s stored in the instruction is the location of the data.

(3) I am no friend of the practice of referring to all DRAM as “global memory” as it seems confusing to some programmers and some automatically associate “global memory” with “slow access” (see the start of this thread for a worked example). I find the following paints a clearer picture: Portions of physical DRAM are mapped as either constant memory (accessed via LDC instruction or constant memory reference), local memory (accessed via LDL/STL instructions), or global memory (accessed via LDG/STG instructions or textures).