Do switch statements require gmem reads for the jump table?

philippaulmurray · June 5, 2021, 12:13pm

I would like to have an array of less than 10 floats stored on the registers in a cuda kernel. I don’t know the indexes at compile time (so I can’t use an array), though, I know for certain that they will all range from 0 to 10.

Below I have written code that indexes a storage struct using a switch table to determine which variable to set/get.

I know a switch statement requires a jump table, so would that need to be stored in global memory?

Assuming each thread in a warp has the same index, I am wondering if the below code would be efficient to retrieve an element from a register, or if it would take the time of a global memory read for the switch table - as I’m not sure if jmp tables are put in the instruction cache or gmem.

struct Storage {
float r0, r1, r2, r3;

__device__ float get(int index) {
	switch (index) {
	case 0:
		return r0;
	case 1: 
		return r1;
	case 2: 
		return r2;
	case 3:
		return r3;
	}
}

__device__ void set(int index, float value) {
	switch (index) {
	case 0:
		r0 = value; return;
	case 1:
		r1 = value; return;
	case 2:
		r2 = value; return;
	case 3:
		r3 = value; return;
	}
}

};

Thanks for any input.

njuffa · June 5, 2021, 6:04pm

Generally speaking, there are a variety of techniques for implementing switch statements. These could be if-then-else sequences or trees, jump tables, or computed branches. Which one(s) a compiler chooses may depend on the number of labels, whether the labels are dense, the code associated with each label, whether there is a default label, and of course the target processor architecture.

It would be best if you looked at the generated code for the specific target architecture and in the specific code context of your use case. You can look at the generated machine code with cuobjdump --dump-sass

A quick test with the simple example provided for a couple of GPU architectures shows a SASS sequence like the following:

        /*0038*/                   SHF.L R0, RZ, 0x2, R0;
        /*0048*/                   LDC R2, c[0x2][R0];
        /*0050*/                   BRX R2 -0x58;

This indicates that in this case, the switch is implemented via a table which is stored in constant memory, specifically constant bank 2. Accesses via LDC are generally backed by a small cache. The usage pattern of the BRX instruction suggests that the table does not store target addresses, but branch offsets, and that BRX adds this offset to the starting address for the switch (0x58 in the example above).

philippaulmurray · June 5, 2021, 7:34pm

Thanks for the length reply.

I still have to get around to checking the disassembly for my kernels, but I guess I didn’t expect for the jump table to be stored in constant memory.

I think I heard that cached constant memory approaches the speed of shared memory, so if that is the case, then I may as well just store in shared memory instead of attempting a register array. (I still have to see the disas for my final kernel though to see how the jump table is implemented by the compiler).

njuffa · June 5, 2021, 7:54pm

As far as I recall, cached constant memory data approaches the speed of a register access, and the CUDA compiler frequently uses it in that capacity. For example, instead of using a load immediate to a register, then using that register, it will stick the immediate data into constant memory and then access it directly with an ALU instruction that can substitute a register reference with a constant memory access (in x86 parlance this would be a “load-execute” instruction). Example FADD rd, rs, c[] instead of FADD rd, rs, rt. Recent GPU architectures have increased the prevalence of immediate operands in ALU instructions, though, presumably to further increase performance.

The constant cache is designed for uniform access across a warp (by means of a broadcast feature), and access will be serialized if more than one address is presented to LDC across a warp.

There are multiple constant memory banks (their size and numbering tends to differ between GPU architecture), and one of these is typically used for compile-time constant data the compiler extracts from code without them being present as literal constants. So it makes perfect sense to store a branch-offset table there. I have played a bit with other switch variants by now and the compiler does not always use a table-based approach. There is probably a number of (architecture-dependent?) heuristics at work that drive the decision of which switch implementation style is used.

Constant memory is just an abstraction. Physically it is simply a separately mapped portion of regular off-die GPU memory, e.g. the DDR6(X) or HBM2 chips connected to the GPU. So any access that does not hit the constant cache will incur a significant performance penalty.

rs277 · June 5, 2021, 8:47pm

Looking at the “Dissecting Turing” paper: [1903.07486v1] Dissecting the NVidia Turing T4 GPU via Microbenchmarking , comparing Tables 3.9 and 3.11 it would seem as though unserialised access to constant and shared are on a par, as long as the constant data is in the L1 cache. Admittedly, 3.9 refers to Turing, but Table 3.1 shows similar performance for a range of architectures, with shared offering the advantage of consistent latency vs the various constant cache levels.

njuffa · June 5, 2021, 10:14pm

The way I understand BRX usage, it would lend itself to the computed-branch approach in cases like the one in the question, as the labels here are dense and there is a small and similar-sized amount of code at each label. So the compiler could simply place the code for the labels at equally spaced address offsets, then multiply the label number by code size per label to compute the offset, instead of pulling the offset from constant memory. This would avoid memory access altogether.

So far I have not seen the CUDA compiler produce such code for a switch. Which doesn’t mean it cannot do that, just that I haven’t observed it in the less-than-methodically selected test cases I have thrown at it.

Topic		Replies	Views
__constant__ use CUDA Programming and Performance	6	15158	June 14, 2008
Constant memory accesses by a threads of a half warp CUDA Programming and Performance	10	1871	October 27, 2014
__constant__ memory in function scope CUDA Programming and Performance	13	4609	June 1, 2015
Using lookup table in constant memory CUDA Programming and Performance	5	3308	January 10, 2019
How to Access Global Memory Efficiently in CUDA C/C++ Kernels Technical Blog	7	644	December 5, 2019
Managing Constant Memory CUDA Programming and Performance	9	5222	November 19, 2010
GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA Technical Blog	8	799	November 22, 2019
Accessing/caching access to global/pinned memory CUDA Programming and Performance	10	766	May 29, 2023
switch construction from C compiled to sequential if elseif elseif ... CUDA Programming and Performance	11	8172	August 1, 2008
Coaxing a [register+immediate] offset store CUDA Programming and Performance	10	1986	September 1, 2015

Do switch statements require gmem reads for the jump table?

Related topics