Array of constants indexed by constants

If I declare a variable in a device function like this

const float myarray[3][3] = { {1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f} };

and I access it only using constants indices then will the compiler simply insert the appropriate floating-point literals into the generated code or will it actually put this array in some memory space or other (needless to say I want it to do the former)? I have some nested loops and am using #pragma unroll to force the compiler to unroll them.

I see no reason why the compiler wouldn’t resolve the load to a literal argument.

The only way to be sure is to inspect your code’s SASS output with cuobjdump or nvdisasm.

I also have an array of 54 float variables (per thread) which I’m again accessing using constant indices. I really need to keep these in registers. I know that’s quite a lot of registers but I have a single large thread block (per multiprocessor) and I’m targetting sm3_0 so I can use up to 63 registers per thread. Looking at the verbose output from the compiler it says its using 30 registers so presumably its decided to put my array in local memory. Is there any way I can influence it or determine why its done this? If I manually unroll all of my loops (a daunting prospect) and use individual named variables am I likely to have more luck or will the result be exactly the same?

There’s no way you’re going to fit in a 54 float vector into registers using sm_30. There are typically about 10 registers overhead, which already puts you at 64 registers, that’s ignoring all other variables in your code. If you wish to use MORE than 30 registers (which would make sense), try experimenting with launch bounds. A quick google search gives you this reference:

Using that example, if you set MAX_THREADS_PER_BLOCK to a low figure (64/128), you should be able to use 63 registers, but again, it won’t hold the whole array. You will still get a speedup I imagine.

Really? The overhead has got that bad? I remember when you were only allowed to use 12 registers per thread total if you wanted maximum occupancy and you could do a lot with those 12 registers.

I can possibly reduce my array to 27 registers but I would expect that to almost halve the efficiency of my kernel. :-(

Thanks for the tip about launch bounds - I’ll look into that.

The overheads may be lower as you have observed, but I do tend to code under the assumptions I have a lot of registers to burn up so maybe I use them a little too freely :) All I can say is trial-and-error. Good luck :)

I think I found my problem - I was running a Debug build and I still had the -G flag turned on in the compiler. Its now showing 63 registers used and 0 bytes of stack. And it even runs! I haven’t checked the correctness of the results of course…