I really don’t like it since it uses too much registers and lowers my occupancy. Is there any way to prevent it from doing so?
Also the additional loads never get hit because the array is fixed size thought that making WIDTH and BLOCK_DIM compile time constants will fix that but somehow the compiler still adds loads that use registers and never get hit
To keep a loop fully rolled, insert the following in the line just before the relevant for or while statement:
#pragma unroll 1
Recent compiler versions (last couple of years?) unroll more aggressively than older ones. This has annoyed me at times, however according to careful measurements I took in several of these cases, the compiler made the right decision, as the unrolled versions were faster (but often barely so, like 1-2%) compared to the rolled loop.