local thread memory & compiller

Hi

Is there any way to prevent compiller from using local thread memory ?
I have a large kernel, and compilling it for profile 1.1
gives me: 60 regs, 4 bytes of lmem
BUT compiling for profile 1.3 gives me 62 regs, 0 bytes of lmem
and in second case performance is about 20% better (on the same machine)
Examining the ptx shows that those 4 bytes of lmem is used as (int) loop counter
WTF ?
kernel looks like this:

device void CastRay(float3 &rorg, float3 &rdir, float4 &rcol)
{
… modify rorg, rdir, rcol …
}

global void kernel()
{
float3 RayPosition = gCameraPosition;
float3 RayDir = …;
float4 RayColor = …;
for (int i = 0; i < 8; i++) // ← here ‘i’ is in local memory for 1.1 profile
{
CastRay(RayPosition, RayDir, RayColor);
if (RayColor.w < EPSILON)
break;
}
… some other operations
}

By “profile” I assume you mean the -arch tag?

The difference is probably due to the increased amount of registers avaialble with 1.3 GPUs.

I don’t know how to force the compiler to use more registers (without working at a lower level). In this case getting rid of the loop counter entirely by unrolling the loop might solve your problem.

Yes, by “profile” I mean ‘-arch sm_11’ → profile 1.1 etc.

Using ‘#pragma unroll’ rolls compiller just to the devil himself :) (it just crashes ptxas.exe)
Unrolling loop manually leads to the same result External Image

There’s no compiler switch to tell compiler not to use lmem. However, you may try -maxrregcount 64 with 1.1 target.

p.s. is ‘quick reply’ feature broken?

I see no “quick reply”. I posted this using “fast reply”.

Just another tip for anyone with that kind of problems: By any means avoid the ?: operator, at least for types like float2 the compiler completely pointlessy uses lmem for me with that, whereas a “if() else” works just fine.
And the usual tip: make variables that are not used all the time and the same for all threads shared (test if syncthreads or writing in each warp gives better performance)

LOL

with -maxrregcount 64 used on 1.1 target
Used 64 registers, 16+0 bytes lmem, 16152+24 bytes smem, 768 bytes cmem[0], 164 bytes cmem[1]

without maxrregcount on 1.1 target
Used 60 registers, 4+0 bytes lmem, 16152+24 bytes smem, 768 bytes cmem[0], 164 bytes cmem[1]

with -maxrregcount 128 used on 1.1 target
Used 67 registers, 16152+24 bytes smem, 768 bytes cmem[0], 164 bytes cmem[1]

– so successs :) [but it need to spare 7 registers to cover 4 bytes of lmem !!!]

same kerne with profile 1.3 without maxrregcount
Used 67 registers, 16152+24 bytes smem, 768 bytes cmem[0], 188 bytes cmem[1]

– see the number of const mem regs compared to 1.1 target … wtf ?

ah BTW the nice tips&trick (at least for me) is to pass all arguments to device ‘functions’
by const reference, this reduced registers usage litle
for example this:
device int RayBoxTest(const float3 &BBMin, const float3 &BBMax, const float3 &RayOrg, const float3 &RayDirInv)

takes less registers than this
device int RayBoxTest(float3 BBMin, float3 BBMax, float3 RayOrg, float3 RayDirInv)

in final kernel

i’v ripped off all ?: operator’s but that changed nothing

shared memory is already used for all common vars that kernels share

I think I heard somewhere that if you use a floating-point literal, and don’t put the little f at the end to make it single-precision, the compiler will store and use it as double precision.

I suppose to be strict to the C standard, this is necessary. If you write this:

float a = 5.0f;

float b = a * 2.0;

‘a’ has to get implicitly converted to double before being multiplied, and the result is then converted back to float. Don’t know if it’ll ever really matter in real life, but in this other example it’d make a difference:

float a = 5.0f;

float b = (a * 2.000000000001) - 10.0f;

// b equals 5e-12, not 0

I’ve heard references actually cause lmem use.

as for the floats i’v always put ‘f’ at end
as for references – i do not see corelation between references and lmem usage :)

Well… if you ever take your kernel apart and figure out what was causing the lmem, pls let us know.

well i’m only know witch variable went to lmem - i dont know why compiller dicided so
(it was ‘standard’ unsigned int used as loop counter)
my wild guess is that compiller puts rarelly used variables to lmem if registers usage is above some internal limit (variable was used only to iterate, not to index some internal’s loop guts, when i’v used it to index something inside loop, the problem was gone even without maxrregcount solution)

Dear NVIDIA please fix at least some compiller odities in CUDA 2.1 :D

I’d be happy with a switch that just completely and fully disables any use of lmem. I have quite a lot of kernels, and so far they all become a factor of 2 - 10 slower when using lmem even compared to using twice the registers or using shared memory.

Admittedly, my stuff is all exclusively memory-bandwidth bound but as soon as things are really large so they do not fit into cache/shared memory a lot of problems are of that kind.