Force data into register and not lmem?

I have a kernel where each thread does almost similar computation (using data stored in smem) on 8 or 16 floats.

These are stored in an array of floats which are looped over. Given my occupancy and block size I have plenty registers to spend to keep the entire array in registers at all times, however, even with --maxrregcount set properly the compiler insists that it wants to use local mem for this array, which more than halves the performance of the kernel.

The speed reduction can be easily verified by just accessing one element, but performing all the computations instead.
The profiler also indicate that a large number of local loads/stores is taking place when using the array.

I observe a speed benefit when storing the data in smem, but this reduce my block size and/or occupancy.

Johan Seland, PhD

I ran into the same issue (see this post )

My friend and colleauge Christopher Dyken was quick to point out that ptx spec says:

Hence arrays alwayes end up in local memory.

I have now sucessfully used Boost.preprocessor to generate variables of the type

float f_0;

float f_1;



and my loop was transormed into a macro indexing the right variable name. Works like a charm.

Johan Seland, PhD


Very interesting. I currently use a home-made template function to do the exact same thing. Could you point me to which part of the Preprocessor library you use, and a small code example perhaps? :)

No problem.

#include <boost/preprocessor/iteration/local.hpp>

#define bsIQ_MAX_SIZE 16

#define bsIQ_def( z, n, data ) float2 bsIQ_##n = make_float2( 0.0f, 0.0f );

#define bsIQ_loop( z, n, unused ) \

f = 1.0f*n/bsIQ_MAX_SIZE \

bs_IQ_##n += f*cos( f ); \

__global__ void foo() {

// Allocate macros

#define BOOST_PP_LOCAL_MACRO(n) bsIQ_def( ~, n, ~ )



// Allocate variable for loop

float f;

#define BOOST_PP_LOCAL_MACRO(n) bsIQ_loop( ~, n, ~ )




Thanks a lot. We were about to commit to the home-brewed template solution but this gives many advantages. Now I just have to come up with a clever way of

creating nested iterations (for (i = …) for (j=…) ) to operate on matrices.

I guess there is no possible way to create the variable names at runtime is there? I thought for a few minutes that this would be a perfect solution for one of the limitations in my code where I currently use shared mem, as the shared mem is too small for the final project. Unfortunately, I remembered that I have to access random points in the array based on results generated at runtime, where this doesn’t work too well.

Good to have a simple solution to use the registers for linear array accesses though, maybe I’ll be able to use it in a later project.

Instead of naming diff variables for diff val, is not it good to have device fn where all the arrays are not spilled to local mem.