Loop through register array without using local memory

I have a kernel where I would like to loop through a register array to calculate some values. I can of course do this without a loop but I want the kernel to be as general as possible

Current code

float beta1, beta2;
for (t = 0; t < DATA_T; t++)
{
    float temp = data[idx];
    beta1 += temp * c_Values[t];
    beta2 += temp * c_Values[offset + t];
}

I would like to change this code to

float beta[number_of_parameters];
for (t = 0; t < DATA_T; t++)
{
    float temp = data[idx];
    for (i = 0; i < number_of_parameters; i++)
    {
        beta[i] += temp * c_Values[offset*i + t];
    }
}

but if I remember correctly, beta will then go into slow local memory since the compiler does not know the indices at compile time. Is it possible to avoid this, or does it not matter for modern GPUs with a L1 cache? Can it be solved with dynamic allocation?

I am not aware of any modern processors that provide for indexing into the register file. If the array is small, you can put a #pragma unroll in front of the loop, and provided the loop has a trip count known at compiler time the compiler will likely unroll the loop completeley. So number_of_parameters would have to be something like a #define’d constant of a template parameter. If the compiler unrolls the loop completely, all indices become compile time costants, allowing the data to be placed into registers. Since the compiler needs to keep register pressure in mind this will only work for arrays that are “sufficiently small”, so there is no gurantee this optimization kicks in. In my experience it seems to work just fine for arrays with less than ten 32-bit words; your mileage may vary.

Can I make a template for different values of number_of_parameters ?

Make sure you inspect your PTX to make sure no locals are being used.

If the compiler won’t cooperate and your DATA and BETA sizes are known (and even varying per iteration) and you’re sure everything can fit in registers then you might consider generating the entire loop using X-Macros.

[ I’m sure I’ve mentioned before on this forum that I find X-Macros incredibly useful for generating CUDA code that isn’t easily expressed in loops or when the compiler won’t generate what I want. My sorting library really exploits X-Macros. :) ]

Since I use the K20c and 680 GTX, was interesting to see your HotSort(from your sorting library)results against thrust::sort() , in particular the performance ratio when n<(2^20). Looks like good work, as I have always been impressed by thrust::sort (using device pointers, not device_vector<>).