local thread memory & compiller

DarkAr · September 24, 2008, 10:40am

Hi

Is there any way to prevent compiller from using local thread memory ?
I have a large kernel, and compilling it for profile 1.1
gives me: 60 regs, 4 bytes of lmem
BUT compiling for profile 1.3 gives me 62 regs, 0 bytes of lmem
and in second case performance is about 20% better (on the same machine)
Examining the ptx shows that those 4 bytes of lmem is used as (int) loop counter
WTF ?
kernel looks like this:

device void CastRay(float3 &rorg, float3 &rdir, float4 &rcol)
{
… modify rorg, rdir, rcol …
}

global void kernel()
{
float3 RayPosition = gCameraPosition;
float3 RayDir = …;
float4 RayColor = …;
for (int i = 0; i < 8; i++) // ← here ‘i’ is in local memory for 1.1 profile
{
CastRay(RayPosition, RayDir, RayColor);
if (RayColor.w < EPSILON)
break;
}
… some other operations
}

Tigga · September 24, 2008, 10:49am

By “profile” I assume you mean the -arch tag?

The difference is probably due to the increased amount of registers avaialble with 1.3 GPUs.

I don’t know how to force the compiler to use more registers (without working at a lower level). In this case getting rid of the loop counter entirely by unrolling the loop might solve your problem.

DarkAr · September 24, 2008, 12:24pm

Yes, by “profile” I mean ‘-arch sm_11’ → profile 1.1 etc.

Using ‘#pragma unroll’ rolls compiller just to the devil himself :) (it just crashes ptxas.exe)
Unrolling loop manually leads to the same result External Image

AndreiB · September 24, 2008, 2:24pm

There’s no compiler switch to tell compiler not to use lmem. However, you may try -maxrregcount 64 with 1.1 target.

p.s. is ‘quick reply’ feature broken?

Tigga · September 24, 2008, 2:36pm

I see no “quick reply”. I posted this using “fast reply”.

Reimar · September 25, 2008, 6:30am

Just another tip for anyone with that kind of problems: By any means avoid the ?: operator, at least for types like float2 the compiler completely pointlessy uses lmem for me with that, whereas a “if() else” works just fine.
And the usual tip: make variables that are not used all the time and the same for all threads shared (test if syncthreads or writing in each warp gives better performance)

DarkAr · September 25, 2008, 7:41am

LOL

with -maxrregcount 64 used on 1.1 target
Used 64 registers, 16+0 bytes lmem, 16152+24 bytes smem, 768 bytes cmem[0], 164 bytes cmem[1]

without maxrregcount on 1.1 target
Used 60 registers, 4+0 bytes lmem, 16152+24 bytes smem, 768 bytes cmem[0], 164 bytes cmem[1]

with -maxrregcount 128 used on 1.1 target
Used 67 registers, 16152+24 bytes smem, 768 bytes cmem[0], 164 bytes cmem[1]

– so successs :) [but it need to spare 7 registers to cover 4 bytes of lmem !!!]

same kerne with profile 1.3 without maxrregcount
Used 67 registers, 16152+24 bytes smem, 768 bytes cmem[0], 188 bytes cmem[1]

– see the number of const mem regs compared to 1.1 target … wtf ?

ah BTW the nice tips&trick (at least for me) is to pass all arguments to device ‘functions’
by const reference, this reduced registers usage litle
for example this:
device int RayBoxTest(const float3 &BBMin, const float3 &BBMax, const float3 &RayOrg, const float3 &RayDirInv)

takes less registers than this
device int RayBoxTest(float3 BBMin, float3 BBMax, float3 RayOrg, float3 RayDirInv)

in final kernel

i’v ripped off all ?: operator’s but that changed nothing

shared memory is already used for all common vars that kernels share

alex_dubinsky · September 25, 2008, 5:06pm

I think I heard somewhere that if you use a floating-point literal, and don’t put the little f at the end to make it single-precision, the compiler will store and use it as double precision.

I suppose to be strict to the C standard, this is necessary. If you write this:

float a = 5.0f;

float b = a * 2.0;

‘a’ has to get implicitly converted to double before being multiplied, and the result is then converted back to float. Don’t know if it’ll ever really matter in real life, but in this other example it’d make a difference:

float a = 5.0f;

float b = (a * 2.000000000001) - 10.0f;

// b equals 5e-12, not 0

alex_dubinsky · September 25, 2008, 5:08pm

I’ve heard references actually cause lmem use.

DarkAr · September 25, 2008, 5:26pm

as for the floats i’v always put ‘f’ at end
as for references – i do not see corelation between references and lmem usage :)

alex_dubinsky · September 26, 2008, 4:10am

Well… if you ever take your kernel apart and figure out what was causing the lmem, pls let us know.

DarkAr · September 26, 2008, 7:53am

well i’m only know witch variable went to lmem - i dont know why compiller dicided so
(it was ‘standard’ unsigned int used as loop counter)
my wild guess is that compiller puts rarelly used variables to lmem if registers usage is above some internal limit (variable was used only to iterate, not to index some internal’s loop guts, when i’v used it to index something inside loop, the problem was gone even without maxrregcount solution)

Dear NVIDIA please fix at least some compiller odities in CUDA 2.1 :D

Reimar · September 26, 2008, 8:29am

I’d be happy with a switch that just completely and fully disables any use of lmem. I have quite a lot of kernels, and so far they all become a factor of 2 - 10 slower when using lmem even compared to using twice the registers or using shared memory.

Admittedly, my stuff is all exclusively memory-bandwidth bound but as soon as things are really large so they do not fit into cache/shared memory a lot of problems are of that kind.

Topic		Replies	Views
why my kernel uses local memory? CUDA Programming and Performance	9	3523	August 21, 2015
lmem -- heeeelp :) CUDA Programming and Performance	9	2976	October 14, 2008
A weird usage of registers... Bad allocation with repetitive tasks CUDA Programming and Performance	2	4374	January 9, 2008
Weird lmem issue CUDA Programming and Performance	3	2367	July 27, 2010
How to prevent nvcc from using local memory? CUDA Programming and Performance	16	22336	February 14, 2008
Force a variable to be stored in a Register Is there any way to ensure a variable CUDA Programming and Performance	13	8962	May 21, 2010
Why does this iterator cause lmem usage? CUDA Programming and Performance	7	1633	October 5, 2013
Weird local mem usage CUDA Programming and Performance	6	2159	January 13, 2009
Increasing Register Usage Haven't seen a good discussion on this since CUDA 1.0... CUDA Programming and Performance	12	9041	August 20, 2009
Global memory vs register storage How to force the compiler to use registers? CUDA Programming and Performance	6	4984	July 3, 2009

local thread memory & compiller

Related Topics