I have a very frustrating problem, in which the compiler decides to keep some intermediate results in registers, while I’m trying to lower the register usage (because of occupancy issues).
At the beginning of the kernel I am copying a 2-D tile from global memory to local memory.
#define g_ofs ((((bid.y << columndiml2) + bid.x) << bdiml2) + (tid.y<<columndiml2) + tid.x) #define s_ofs ((tid.y<<bdiml2) + tid.x) #define xstep (1<<(bdiml2-1)) #define systep (xstep << bdiml2) #define gystep (xstep << columndiml2) shared[s_ofs] = data[g_ofs]; shared[s_ofs+xstep] = data[g_ofs+xstep]; shared[s_ofs+systep] = data[g_ofs+gystep]; shared[s_ofs+systep+xstep] = data[g_ofs+gystep+xstep]; __syncthreads();
Then there is the internals of the kernel, which do not use the g_ofs/s_ofs etc at all. At the end, after processing, I write the tile back:
data[g_ofs] = shared[s_ofs]; data[g_ofs+xstep] = shared[s_ofs+xstep]; data[g_ofs+gystep] = shared[s_ofs+systep]; data[g_ofs+gystep+xstep] = shared[s_ofs+systep+xstep];
I would have expected that my macros recalculate the expression both times. But too bad, the compiler is “smart” enough to keep the result of the offset calculations, wasting 6 registers along the way.
Is there some compiler flag, pragma, or other way of hinting the compiler that I don’t want to keep these (trival) results in registers but just want to recalculate them?