Do switch statements require gmem reads for the jump table?

I would like to have an array of less than 10 floats stored on the registers in a cuda kernel. I don’t know the indexes at compile time (so I can’t use an array), though, I know for certain that they will all range from 0 to 10.

Below I have written code that indexes a storage struct using a switch table to determine which variable to set/get.

I know a switch statement requires a jump table, so would that need to be stored in global memory?

Assuming each thread in a warp has the same index, I am wondering if the below code would be efficient to retrieve an element from a register, or if it would take the time of a global memory read for the switch table - as I’m not sure if jmp tables are put in the instruction cache or gmem.

struct Storage {
float r0, r1, r2, r3;

__device__ float get(int index) {
	switch (index) {
	case 0:
		return r0;
	case 1: 
		return r1;
	case 2: 
		return r2;
	case 3:
		return r3;
	}
}

__device__ void set(int index, float value) {
	switch (index) {
	case 0:
		r0 = value; return;
	case 1:
		r1 = value; return;
	case 2:
		r2 = value; return;
	case 3:
		r3 = value; return;
	}
}

};

Thanks for any input.

Generally speaking, there are a variety of techniques for implementing switch statements. These could be if-then-else sequences or trees, jump tables, or computed branches. Which one(s) a compiler chooses may depend on the number of labels, whether the labels are dense, the code associated with each label, whether there is a default label, and of course the target processor architecture.

It would be best if you looked at the generated code for the specific target architecture and in the specific code context of your use case. You can look at the generated machine code with cuobjdump --dump-sass

A quick test with the simple example provided for a couple of GPU architectures shows a SASS sequence like the following:

        /*0038*/                   SHF.L R0, RZ, 0x2, R0;
        /*0048*/                   LDC R2, c[0x2][R0];
        /*0050*/                   BRX R2 -0x58;

This indicates that in this case, the switch is implemented via a table which is stored in constant memory, specifically constant bank 2. Accesses via LDC are generally backed by a small cache. The usage pattern of the BRX instruction suggests that the table does not store target addresses, but branch offsets, and that BRX adds this offset to the starting address for the switch (0x58 in the example above).

1 Like

Thanks for the length reply.

I still have to get around to checking the disassembly for my kernels, but I guess I didn’t expect for the jump table to be stored in constant memory.

I think I heard that cached constant memory approaches the speed of shared memory, so if that is the case, then I may as well just store in shared memory instead of attempting a register array. (I still have to see the disas for my final kernel though to see how the jump table is implemented by the compiler).

As far as I recall, cached constant memory data approaches the speed of a register access, and the CUDA compiler frequently uses it in that capacity. For example, instead of using a load immediate to a register, then using that register, it will stick the immediate data into constant memory and then access it directly with an ALU instruction that can substitute a register reference with a constant memory access (in x86 parlance this would be a “load-execute” instruction). Example FADD rd, rs, c[] instead of FADD rd, rs, rt. Recent GPU architectures have increased the prevalence of immediate operands in ALU instructions, though, presumably to further increase performance.

The constant cache is designed for uniform access across a warp (by means of a broadcast feature), and access will be serialized if more than one address is presented to LDC across a warp.

There are multiple constant memory banks (their size and numbering tends to differ between GPU architecture), and one of these is typically used for compile-time constant data the compiler extracts from code without them being present as literal constants. So it makes perfect sense to store a branch-offset table there. I have played a bit with other switch variants by now and the compiler does not always use a table-based approach. There is probably a number of (architecture-dependent?) heuristics at work that drive the decision of which switch implementation style is used.

Constant memory is just an abstraction. Physically it is simply a separately mapped portion of regular off-die GPU memory, e.g. the DDR6(X) or HBM2 chips connected to the GPU. So any access that does not hit the constant cache will incur a significant performance penalty.

1 Like

Looking at the “Dissecting Turing” paper: [1903.07486v1] Dissecting the NVidia Turing T4 GPU via Microbenchmarking , comparing Tables 3.9 and 3.11 it would seem as though unserialised access to constant and shared are on a par, as long as the constant data is in the L1 cache. Admittedly, 3.9 refers to Turing, but Table 3.1 shows similar performance for a range of architectures, with shared offering the advantage of consistent latency vs the various constant cache levels.

The way I understand BRX usage, it would lend itself to the computed-branch approach in cases like the one in the question, as the labels here are dense and there is a small and similar-sized amount of code at each label. So the compiler could simply place the code for the labels at equally spaced address offsets, then multiply the label number by code size per label to compute the offset, instead of pulling the offset from constant memory. This would avoid memory access altogether.

So far I have not seen the CUDA compiler produce such code for a switch. Which doesn’t mean it cannot do that, just that I haven’t observed it in the less-than-methodically selected test cases I have thrown at it.