Why inline instead of register? using register memory


for my understanding my kernel is supposed to store the result of a calculation in a register variable. but instead it seems that it does the same calculation each time im refering to this variable.

in real my calculations are much more complex but for better illustrating my issue i use a simplified example:

global myKernel(float arg1, float arg2, float arg3, float *globalArray)
int idx = … ;

float reg1 = __cosf(arg1) + arg2 * __sinf(arg1) / -arg3;
float reg2 = __sinf(arg1) + arg3 * __cosf(arg1) / -arg2;

float reg3 = reg1 * reg2 + 128.0f;
float reg4 = reg1 / reg2 - 128.0f;

globalArray[idx] = reg3 * reg4;


this kernel should store two results, in reg1 and reg2. and for calculating reg3 and reg4 i just want to reuse the precalculated results.

but for some reason there is no difference concerning performance to the following kernel:

global myKernel(float arg1, float arg2, float arg3, float *globalArray)
int idx = … ;

// float reg1 = __cosf(arg1) + arg2 * __sinf(arg1) / -arg3;
// float reg2 = __sinf(arg1) + arg3 * __cosf(arg1) / -arg2;

float reg3 = (__cosf(arg1) + arg2 * __sinf(arg1) / -arg3) * (__sinf(arg1) + arg3 * __cosf(arg1) / -arg2) + 128.0f;
float reg4 = (__cosf(arg1) + arg2 * __sinf(arg1) / -arg3) / (__sinf(arg1) + arg3 * __cosf(arg1) / -arg2) - 128.0f;

globalArray[idx] = reg3 * reg4;


as mentioned it seems that my kernel uses the variables reg1 and reg2 as some sort of inline methods.

does anyone know how to avoid that?

is it possible that maybe this scenario appears, when i use too many register varibles in my kernel?

best regards, rob

Each of your threads will do the calculation (remember kernel is executed in parallel).

You can declare the variables as shared but if reg1 and reg2 don’t depend on the thread id (in your code they do not) I would calculate them on the host and pass their values as parameters.

Like Noel said, if you have parameters that are independent of block/thread-indices, then either send them as parameters to the kernel or place them in constant memory.
Now, regarding your question. You shouldn’t be asking why the register-path is as slow as the recalculation path, instead you should wonder why the recalculation path is as fast as taking the register path.
If you look at decuda’s output then you’ll see that in both cases the sin and cos functions are never evaluated more than once. Be glad the compiler is smart enough to optimize the ‘ineffecient’ code by itself :)


Gee, look up the “volatile trick” in these forums.

There’s hardly any forum thread these days where recommending the “volatile” keyword would be inappropriate ;)


How are you measuring the performance? What is the execution time? How many threads are in your grid?

Without looking at the PTX/decuda output it is hard to know, but one possibility is that the compiler is automatically recombining the calculation in the second case. The static single assignment nvcc uses as an intermediate output makes that type of compiler optimization trivial to do.

Another possibility is that even with all those FLOPS, you are memory bandwidth bound, and thus the performance will not change even if you double the floating point calculations.

Lastly, it is possible that you are measuring the time incorrectly (i.e., no cudaThreadSynchronize would mean you are just measuring the kernel launch overhead). Or if you are timing a very small grid then you could also be timing only the very small kernel launch overhead.