Have there been any improvement in the way we can control where a thread’s variables are placed on the device? I think my performance is being severely hamstrung by the compilers need to place some variables in local memory instead of in registers.
typedef struct {
short pos;
short backtrack_it;
float steps_from_start;
float euclidean_dist;
} node;
//DEVICE CODE:
node register_list[16]; //placed in local memory
//copy some data into register_list from global memory...
memcpy(register_list, &some_global_list[somewhere], 16 * sizeof(node));
float lowest_dist_found = 99999; //placed in register
int lowest_iterator = -1; //placed in register
node best_node; //placed in local memory
for(int i = 0; i < 16; ++i){
if(register_list[i].steps_from_start + register_list[i].euclidean_dist < lowest_dist_found){
lowest_iterator = i;
best_node = register_list[i];
}
}
//do some more stuff with the data...
This operation is very slow, and if I could place the variables of type “node” into registers, it would probably speed up my performance significantly. I have checked and I have enough register space to accommodate the array. But even so, the compiler chooses to put the lone “node best_node” into local memory, which is really annoying.
To my knowledge, the fact that I’m accessing the node array dynamically in the loop should not force the array to be placed in local memory since the loop will be unwinded anyway…
Is there a way to either explicitly place variables into registers or perhaps hint at the compiler to be more lenient?
Thanks for any help you can give me!
There is no way to explicitly place variables into registers. The only hinting you can give to the compiler in this regard is by using register limiting, either via compiler switch (-maxrregcount) or compiler directive (launch_bounds).
However both of these (vs. not using them) would only have the effect of limiting the compiler’s use of registers, not increasing the compiler’s use of registers.
The compiler generally attempts to use registers to achieve maximum performance and will, in my experience, use more registers if it thinks by doing so performance will go up.
There may be cases where the compiler doesn’t do its job well. I’m not sure such a case can be declared based on your description and code snippet. More analysis is needed. Nevertheless you’re welcome to file a bug at developer.nvidia.com. The instructions are linked in a sticky post at the top of this sub-forum.
Certain situations prevent use of registers, such as indexing-into-array operations, where the index cannot be computed at compile time (and it may also require ability to unroll loops). I think complex structures may also be “not easy” for the compiler to map into registers, but I haven’t studied that case too much nor have I studied your code very much.
Thanks for the quick response!
It seems weird to me that the compiler does not actually count the available memory in the registers, since i should have plenty (20B) of space left available if none of my variables gets placed in local memory.
So I guess my best course of action is to manually unroll the loop and split my node struct into smaller pieces. And if that fails i will split my register_list array into individual variables (YAY!).
Upon further tinkering with structs I discovered something interesting.
//DEVICE CODE:
int normal_int = 1; //placed in register
float normal_float = 3; //placed in register
struct{ int value; } test_int; //placed in local memory
test_int.value = 2;
struct{ float value; } test_float; //placed in local memory
test_float.value = 4;
It does not matter how the structures are arranged. All variables of a struct type will be placed in local memory. Is this intended behaviour, or something strange my machine is doing?
Are you looking at the code generated for a release build? Placing (thread-)local variables into a register is an optimization, by default such variables live in (thread-)local memory. What you observe may also depend on code context.
In my simple test case below, a single-‘float’ struct is placed in a register (specifically, R0):
C:\Users\Norbert\My Programs>cat struct_reg.cu
__global__ void kernel (float x, float *res)
{
struct { float value; } p;
p.value = 0.33333333f;
p.value = fmaf (p.value, x, 0.5f);
p.value = fmaf (p.value, x, 1.0f);
*res = p.value;
}
C:\Users\Norbert\My Programs>nvcc -c -arch=sm_61 -o struct_reg.obj struct_reg.cu
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
struct_reg.cu
support for Microsoft Visual Studio 2010 has been deprecated!
C:\Users\Norbert\My Programs>cuobjdump --dump-sass struct_reg.obj
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = windows
compile_size = 64bit
code for sm_61
Function : _Z6kernelfPf
.headerflags @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fc400fe2007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ MOV R0, c[0x0][0x140]; /* 0x4c98078005070000 */
/*0018*/ MOV32I R2, 0x3f000000; /* 0x0103f0000007f002 */
/* 0x001fc400fe8007f1 */
/*0028*/ MOV32I R5, 0x3f800000; /* 0x0103f8000007f005 */
/*0030*/ MOV R3, c[0x0][0x14c]; /* 0x4c98078005370003 */
/*0038*/ FFMA R0, R0, c[0x2][0x0], R2; /* 0x4980010800070000 */
/* 0x0003c400fe4007e5 */
/*0048*/ MOV R2, c[0x0][0x148]; /* 0x4c98078005270002 */
/*0050*/ FFMA R0, R0, c[0x0][0x140], R5; /* 0x4980028005070000 */
/*0058*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/* 0x001f8000ffe007ff */
/*0068*/ EXIT; /* 0xe30000000007000f */
/*0070*/ BRA 0x70; /* 0xe2400fffff87000f */
/*0078*/ NOP; /* 0x50b0000000070f00 */
.............................
Hello njuffa.
I get the same PTX as you when compiling your struct_reg.cu example (although i dont understand the generated code). But when running the VS2017 Nsight Legacy debugger i can see that when i create a variable of a struct with a single float the compiler will not place the variable in the registry, and my performance tests also points to that being the case.
Do you know if there might be something else in a project that can have an affect on CUDA’s optimization techniques? Sorry if this i noob question, i have never played around with CUDA compilation before.
When you are using the debugger, you are presumably using a debug build. For CUDA debug builds, all optimizations are turned off. Since keeping local variables in registers is an optimization, I would not expect to see any variable pretty much permanently stored in registers in that case. Based on observation, the compiler may even use “pessimizations”, in order to ensure that each variable can be tracked during debugging.
In optimized code from a release build, variables from the source code may disappear completely, there may be new variables created by the compiler, and what register holds which variable can change multiple times within a fairly small amount of code.
When analyzing code, you almost never (exception: suspected compiler bugs) want to look at the generated PTX. PTX is an intermediate compiler representation and virtual ISA, and is compiled by an optimizing compiler, ptxas, into machine code (SASS). The thing you would want to look at is SASS.
Generally, you wouldn’t want to do the kind of low-level analysis we are discussing here when you are a noob. Trust the compiler, as it will “do the right thing” 95% of the time, and even the remaining 5% of the time it will do a somewhat reasonable job. Instead, adjust your brain to parallel programming paradigms, take a look at the Best Practices Guide, and acquaint yourself with the CUDA profiler to point you to bottlenecks in your code.