Why compiler prefer to use registers to cache hot data rather than constant memory?

I have a program like this:

__constant__ int data[256];
__constant__ uint8_t index[7][16];
__global__ void kernel(some params){
    //Read some input from global memory

    for(int i = 0; i < 4; i++) {
          for (int j = 0; j < 7; j++){
                  for(int k = 0; k < 8; k++) {
                          //Use data[i * 7 * 8 + j * 8 + k], index[j][2k] and index[j][2k+1] do some calculation
                          //The calculation is mainly consist of funnel shift, xor, 3-input addition and 2-input addition
                  }
           }
    }

    //Write to global memory
}

I found that compiler tries to cache the index array in register. In the beginning of SASS code, compiler use many LDC instruction to load data in the index array into register and cause register usage of kernel exploded to maximum(255 registers per thread).
I think the size of constant memory cache is 8KB(or 4KB). It’s enough to hold all the data in the index array.
Besides, if all threads in a warp access the same address in cache, it’s as fast as register.
So why compiler tries to hold hot data in register rather than constant memory ?

Besides, if all threads in a warp access the same address in cache, it’s as fast as register.

Do you have a source for this statement?

The data still needs to be loaded from the cache into a register before it can be used. This may stall a warp if the value is required for the next instruction. With registers, the memory stall should only occur for ì = 0.

I find the statement in Best Practices Guide :: CUDA Toolkit Documentation
If all threads of a warp access the same location, then constant memory can be as fast as a register access.

It still requires an extra access step in various cases.

The GPU is largely a load-store machine. That means, with a few exceptions, that operands for any instruction must be register operands.

The data for the instruction, if it exists in a register, will require (for many instructions) one fewer instruction and one fewer step in order to be consumed, as compared to data that only exists in constant memory.

So it’s not completely implausible that the compiler may choose to pull data into a register from a constant memory location.

Whether or not it is sensible in your case is difficult to say. In general, after correctness, I would say that the number one priority of the compiler is performance, and the compiler certainly “knows” a lot about how to unlock GPU performance. So without any other data, I would suggest that the compiler is making a decision about performance.

It’s possible the compiler is wrong, of course. In that case, if you can come up with a convincing comparative test case (that includes demonstrated/measured performance, not just an assumption based on register usage) then the usual suggestion at that point would be to file a bug.

There is some evidence to suggest this statement is not true. Table 3.1, on page 22 of Dissecting Turing shows that the broadcast latency of the constant cache (L1 & 2k in size), varies slightly around 27 cycles depending on SM arch.

(1) Using many registers is not a red flag per se. One would have to show that an alternative translation into code using fewer registers would in fact yield higher performance. If I recall correctly, the cases of high register usage discussed in these forums in the past few years have shown that the CUDA compiler generally makes the right decision, in that attempts to reduce the register usage resulted in longer execution times. Obviously a compiler cannot make perfect decisions all the time, but the CUDA toolchain is quite mature at this time and rarely errs in this question my experience.

(2) The sentence in the Best Practices Guide about uniform constant memory access being as fast as a register access (as long as there is a hit in the constant cache!) has been discusses in these forums before. It used to be true, say, in the Kepler time frame. One should keep in mind that huge register files such as those found in GPUs are not single-cycle access either, as access latency increases with increasing register file size.

Best I know, the physical implementation styles of both register file and constant cache have changed since Kepler times (possibly multiple times by now), so NVIDIA should review this particular sentence in the Best Practices Guide. Changes to the hardware ISA, in particular the widening of the floating-point immediate field in floating-point instructions, reduce the need for constant cache usage. This may be driven by a desire for energy efficiency, but could also be an indication of awareness that constant cache usage is not as performance competitive as it used to be. That said, it still seems that constant cache use is “the next best thing” to pulling data from a register.

Looking at the table referenced above, probably better to roughly equate it to conflict free shared memory access.

Agreed.

I’m confused. Do you mean a SASS instruction in code maybe translated into two or more instruction when executed if some operand is not located in a register?
For example, I see some SASS instruction like this:
IMAD R2, R2, c[0x0][0x0], R3
Do you mean it’s translated into two instructions: a LDC instruction which loads data from constant memory into register and a IMAD instruction which does calculation with all operands in registers? So two instructions are issued by dispatch unit instead of one? Or maybe still only one instruction is issued but two executed?
Or it’s still executed in the FMA pipeline as a single instruction, but the pipeline do some additional work to load the data from constant memory into register?

I doubt data in this paper. According to this paper, the dependent issue latency of IMAD is 5 on Turing architecture(table 4.1 on page 40). However, this can not explain what i got in my experiment.
I launched a program on 2080ti which run 4 warps per sm. So each smsp runs only one warp.
Each warp execute a lot of IMAD instructions like this:

        /*01a0*/                   IMAD R6, R5, R4, R7 ;                    /* 0x0000000405067224 */
                                                                            /* 0x000fc800078e0207 */
        /*01b0*/                   IMAD R5, R4, R7, R6 ;                    /* 0x0000000704057224 */
                                                                            /* 0x000fc800078e0206 */
        /*01c0*/                   IMAD R4, R7, R6, R5 ;                    /* 0x0000000607047224 */
                                                                            /* 0x000fc800078e0205 */
        /*01d0*/                   IMAD R7, R6, R5, R4 ;                    /* 0x0000000506077224 */
                                                                            /* 0x000fc800078e0204 */

One input of each IMAD is output of previous IMAD. From nsight compute, I see the utilization of FMA is nearly 50%. So the dependent issue latency should be 4 cyles, not 5.

These "dissecting [architecture]* papers must be taken with a grain of salt (or two). I have found questionable data in them before. But from creating micro-architectural benchmarks in a past life, I know it is hard to set them up such that they deliver conclusive and unambiguous results, and mis-interpretation of timing data is a distinct possibility.

To the best of my knowledge, NVIDIA has never revealed in any public document how their GPUs handle load-execute instructions likeIMAD R2, R2, c[0x0][0x0], R3 internally (e.g. is the instruction split into separate load and execution ops past instruction decode). Generally NVIDIA GPUs follow the load-store architecture paradigm; the exception to that are load-execute instruction flavors with constant-bank references. Accesses to global memory, local memory, and shared memory all require load/store. This is probably a legacy element left over from graphics shaders of old, but it does help to promote code density in compute kernels.

Agree :)

Sorry, I missed some important information when I abstracted my real code.
In fact, I didn’t use the value in the index array to do calculation directly. I used the value as index to access a local array. So the real code should be abstracted as below:

__constant__ int data[256];
__constant__ uint8_t index[7][16];
__global__ void kernel(some params){
    uint32_t state[16] = {0};
    for(int i = 0; i < 256; i++) {
          state[0] += data[i];
          for (int j = 0; j < 7; j++){
                  uint32_t prev = state[index[j][0]];
                  for(int k = 1; k < 16; k++) {
                            uint32_t t = index[j][k];
                            prev = state[t] = prev * state[t] + data[i];
                  }
           }
    }

    //Write to global memory
}

As the value of index array is not known at compile time, the state array is located in local memory. Each time accessing state[t] requires accessing index array first and then multiply it by 4 to calculate offset in state array. In this case, the compiler choose to load the index array, calculate address of state[t] and then cache these address in registers. That’s one reason why many registers are consumed. I need do more deeper analyzation.

Besides,
I wrote another simple program to test performance of constant memory when cache hit:

constexpr uint32_t CONST_NUM = (256) / sizeof(uint32_t);
__constant__ uint32_t c_data[CONST_NUM];

__global__
void cmem_perf_test(uint32_t *out){
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    uint32_t result = c_data[threadIdx.x];
    for (uint32_t i = 0; i < 2000; i++) {
        for (uint32_t j = 0; j < CONST_NUM; j++) {
            result = result * result + c_data[j];
        }
    }
    out[tid] = result;
}

The SASS code consist of many IMAD instruction like blow:

     IMAD R5, R5, R5, c[0x3][0x34] 
     IMAD R5, R5, R5, c[0x3][0x38] 
     IMAD R5, R5, R5, c[0x3][0x3c] 
     IMAD R5, R5, R5, c[0x3][0x40] 
     IMAD R5, R5, R5, c[0x3][0x44] 

I also run this kernel on 2080Ti and make sure only one warp is scheduled in one smsp.
The performance is equal to IMAD instruction with all 3 operands coming from registers.
It seems maybe the statement in the best practice documentation is correct.

No, I mean that not every instruction that might use constant data can take a constant operand like the one you have shown. Obviously IMAD can take constant data (directly) as an operand.

But if an instruction cannot use constant data directly as an operand, then the compiler first needs to do something like LDC or MOV (or IMAD, etc.). That’s an “extra” instruction/step, compared to having the data already in a register.

In a later comment you mention using those values as indexing values. I’m fairly certain that such operations as operands require registers, not constant data. I’ve seen this, of course:

STL    [R0], R1;

I’ve never seen this:

STL    [c[0x3][0x34]], R1;

So perhaps its not surprising that the compiler is loading indexing values into registers. At any rate, I cannot conclusively explain what the compiler is doing in your case.

As long as the indices are the same for the whole warp (and it can be proven at compile time), the values can be loaded into uniform registers (not filling the normal registers). The uniform data path with execution units and registers is similar to a 33rd lane, executing instructions identical for the whole warp.

(But the uniform path seems to not have been used in your case?)

Compared to the other caches, the constant cache is not meant for buffering data with streaming use. It rather is intended for repeated use of the same coefficients in a loop.

That is, why (streaming) data is prefetched earlier and constant data is used directly at the relevant instruction. The instructions prefetching data from memory into registers can be executed, even if the data is not available yet. The dependency slot is marked in the instructions so that later instructions dependent on the to-be-filled register can wait, if the data has not arrived yet. This dependency marking is not possible for constant data used directly as input operand (instead of loading constant data with LDC).

As you declared both data and indices as __constant__ I am not sure, how the compiler distinguishes both uses in that case. Is any of the instructions dependent on the thread id?

You can also test the effects of using the __ldg intrinsic (Programming Guide :: CUDA Toolkit Documentation) for explicitly reading through the read-only data cache.

In this simple case, a lot of uniform registers(more than 50) are used to cache address of element in the state array. But there are also some normal registers are used to cache address. Maybe that’s because the uniform registers are not enough?
How many uniform registers are available per warp on ampere architecture?

It’s wired no uniform registers are used in my real code. I need do more analization.

Unless it has changed from Turing, this document states: “The upper limit of total registers is 256, including both regular and uniform registers”, p20.

I realise the authors of this are subject to doubt, (grains of salt etc. ;), but it seems odd to me that Nvidia would host their presentation, knowing it contained incorrect information.

The linked document is hosted by NVIDIA because it belongs to a session at GTC 2019 and thus included in the online repository for the conference. I do not think this can be interpreted as endorsement of its content by NVIDIA. There is also a difference between calling into question certain findings in a document vs calling into question the authors of said document, and I do not recall anyone in these forums engaging in the latter.

Not everything written up in a paper, even when peer reviewed in a prestigious journal, is the gospel truth. In fact, my personal experience with trying to implement algorithms from publications would seems to indicate that about half (!) of publications suffer from material errors or omissions. The world isn’t perfect, stuff happens.

The number of registers available is a data item one can glimpse from an analysis of instruction encodings, rather than basing it on (possibly flawed) measurements and subsequent (possibly flawed) interpretation. The likelihood of getting this information wrong seems exceedingly small. FWIW, I would assume that the number of registers available to programmers is 255, as one of the 256 encodings is presumably reserved for the zero register RZ.

Indeed, I should have said “the findings of this research”, instead of “authors of this”.

Yes, the Arxiv hosted paper states “Instructions on Turing still supports the 256 regular registers (including the general-purpose R0–R254 and the Zero Register RZ).” and “we were able to enumerate the 64 uniform registers supported by Turing (including a Uniform Zero Register URZ and 63 general-purpose uniform registers UR0–UR62) by systematically disassembling packed uniform instructions.”

This is still true in Ampere. IIRC some tools report or store the overall numbers of used registers instead of just the regular registers.

I just looked at the paper to the talk:

We found that the cuobjdump -dump-resource-usage command (that prints a kernel’s register usage) reports a count that includes both regular and uniform registers. The upper limit of total registers used in any CUDA kernel is 256, unchanged from Volta.

We confirmed this result by patching the register count in the section header of a CUDA kernel to values above 256, and determining that cuobjdump only recognizes 256 registers at most.

(https://arxiv.org/pdf/1903.07486.pdf)

If this is true (and not just a limitation of the tools), the reason could be (just guessing) that on the hardware level, the regular registers of each lane store a copy of all uniform registers for faster access for the instructions that use both - regular and uniform registers.

But I believe not, then there would be a lot more SASS instructions being able to use uniform registers and they would (when reading) use the same encoding as when accessing regular registers. As in new CUDA architectures the SM does not have to check for instruction latencies and dependencies, as this information is stored in the instruction words, the SM would not need to know, whether a register to read is a general or uniform register. So the shadow register theory is unlikely.

Another theory would be that the compilation has several intermediate representations and a conversion of some instructions to the uniform datapath was included very late in the process, when the distribution of the data to registers and thus the number of registers was already fixed.

To prevent the compiled program from using more than 255 general registers, the maximum total as input to the compiler would have been limited to 255. But in this case, -maxrregcount would also limit the total number of registers.

A third theory is that the number of registers put into the object file was limited to 256 (not 255?) for compatibility reasons and it has nothing to do with the capabilities of the compiler or the GPUs.

We could hack together a few SASS commands that fill and read back all 255+63 registers and try it out.

Does anybody know the maximum number of uniform registers per SM? Or per block? We know the limit of 64 per warp due to instruction encoding. What if several warps run in each partition? Then the uniform registers would have to be stored in a register file.

In the Hot Chips 31 presentation of the Turing architecture they showed a slide with a regular register file of 64 kB (i.e. 4 Bytes * 32 lanes * 512) per partition and a uniform register file of 2 kB (i.e. 4 Bytes * 1 lane * 512) per partition. So from this slide it should actually be possible to use as many uniform registers as regular registers (per thread, of course they are shared within the warp, so there are 32x less), but up to a maximum of 63 (due to instruction encoding).
(RTX on—The NVIDIA Turing GPU | Semantic Scholar you can find the presentation also on Youtube and there is a PDF version)

Having the same overall register size (/32) means that there is no additional criteria for occupancy calculation and no different logic for register file division to warps.

NVIDIA has at least one patent related to uniform registers which may provide some insights: US10866806B2 Uniform register file for improved resource utilization.

Standard caveats about reading patents apply. My take is that it does not matter what is happening under the hood, and what is or is not exposed through the tool chain. Let’s assume someone takes the time tio reverse engineer it. What would one do with that information? I don’t think anyone is going to build an alternative toolchain for NVIDIA GPUs that is usable by the general public in the longer term (see various one-off SASS assembler projects).