lmem -- heeeelp :)

this kernel:

constant unsigned int gNumberOfTriangles

global void KDKernelMINMAXCopy(float4 *Target, float4 *Source, uint2 *Keys)

{

const unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

if (x >= gNumberOfTriangles)

 return;

unsigned int Key = Keys.y;

float4 mins = Source[(Key<<1)+0];

float4 maxs = Source[(Key<<1)+1];

	mins.w = __int_as_float(Key);

	maxs.w = __int_as_float(Key);

Target[(x<<1)+0] = mins;

Target[(x<<1)+1] = maxs;

return;

}

Compilation commandline:

(CUDA_BIN_PATH)\nvcc.exe" -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 8\VC\bin" -arch sm_13 -c -keep -Xptxas=-v -DUSE_GT200 -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /Ox,/Ob2,/Oi,/Ot,/EHsc,/MT,/GS-,/GR,/W3,/nologo,/Wp64,/Zi -I"(CUDA_INC_PATH)” -I./ -I…/…/common/inc -I"$(DXSDK_DIR)\Include" -Xcudafe --diag_suppress=unsigned_compare_with_negative -o x64$(ConfigurationName)\KDKernelConstruct.obj …\KDTree\KDKernelConstruct.cu

gives this:

Used 11 registers, 16+0 bytes lmem, 40+32 bytes smem, 40 bytes cmem[0]

and there is no way to rearange it to not use this fucking ‘local’ memory :/

since this is a kernel that just basically copies data from one place to another,

the lmem in the middle is the last thing i want :/

anyone have any idea how to rearange it to force compiller to not use this lmem ?

as it is shown in the ptx asm below, lmem is used to save 4 registers

where there is no need to save any registers since 11+4 is < 16 !!!

-maxrregcount=16

What kind of rearranging did you try? Since the compiler does it right for the mins, did you try e.g.

float Keyf = __int_as_float(Key);

Source += Key << 1;

Target += x << 1;

float4 vals  = *Source;

*Target = (float4){vals.x, vals.y, vals.z, Keyf};

Source++; Target++;

vals  = *Source;

*Target = (float4){vals.x, vals.y, vals.z, Keyf};

There is also some make_float4 or such macro if you do not like the C99 syntax there.

Just keep in mind that compilers still are really stupid and will miss any optimization opportunity you give them a chance to miss :P

it is not an option because i have 4 kernels in one *.cu file, and other 3 uses 32 regs

a really really dont want one separate *.cu for each kernel :)

but what about a coalesced read/write ? first you read one element, the write it, then read another, then write it.

You misunderstood coalescing, coalescing is about the memory accesses done in parallel by different threads, not about the serial memory access of a single thread (which means there is no coalescing with your current code anyway).

It might make a difference if the compiler would merge the two reads, but

  1. I do not think it would do it

  2. since you already read a float4 it can not merge it anyway, the largest read instruction is a 128 bit read

(the same applies to writes).

It’s “not an option” because it disagrees with your aesthetics?

because it’s a pain in the ass,

for every kernel i need separate *.cu file to be able to define maxregs

there SHOULD be some #pragma in CUDA 2.1 to do this per kernel, not per file.

I completely agree about the #pragma.

But separate files aren’t so bad. You don’t need the Driver API or anything. In each .cu file you have your kernel and a C++ wrapper that call it, and let it compile into a .o. Then in the other files you just call the wrapper (ie, you don’t use the <<<>>> syntax). Everything links together like in ordinary C++.

P.S. I doubt you’ll lose much in performance if your 16-reg kernel is told to use 32 regs. a) it probably will still use 16 regs and b ) lower occupancy probably won’t hurt performance, esp on gtx260