Hi
Is there any way to prevent compiller from using local thread memory ?
I have a large kernel, and compilling it for profile 1.1
gives me: 60 regs, 4 bytes of lmem
BUT compiling for profile 1.3 gives me 62 regs, 0 bytes of lmem
and in second case performance is about 20% better (on the same machine)
Examining the ptx shows that those 4 bytes of lmem is used as (int) loop counter
WTF ?
kernel looks like this:
device void CastRay(float3 &rorg, float3 &rdir, float4 &rcol)
{
… modify rorg, rdir, rcol …
}
global void kernel()
{
float3 RayPosition = gCameraPosition;
float3 RayDir = …;
float4 RayColor = …;
for (int i = 0; i < 8; i++) // ← here ‘i’ is in local memory for 1.1 profile
{
CastRay(RayPosition, RayDir, RayColor);
if (RayColor.w < EPSILON)
break;
}
… some other operations
}