question about dynamic shared memory

Hello,

i have that kernel (suint - short unsigned int):

extern __shared__ uint shared[];

__global__ void merge_small(uint* d_block,suint pipeSize, suint shift){

	uint* result=shared;

	uint* blockX=shared + pipeSize<<1;

	//__shared__ uint result[1024];

	//__shared__ uint blockX[512];

	suint k,p;

	

	p=blockIdx.x<<1 + shift;

	uint* d_blockA=d_block + pipeSize*p;

	uint* d_blockB=d_blockA + pipeSize;

	

	blockX[threadIdx.x]=d_blockB[threadIdx.x];

	k=d_blockA[threadIdx.x];

	__syncthreads();

	

	p=binarySearchLeft(blockX,pipeSize,k);

	result[threadIdx.x+p]=k;

	

	__syncthreads();

	blockX[threadIdx.x]=d_blockA[threadIdx.x];

	k=d_blockB[threadIdx.x];

	__syncthreads();

	p=binarySearchRight(blockX,pipeSize,k);

	result[threadIdx.x+p]=k;

	__syncthreads();

	d_blockA[threadIdx.x]=result[threadIdx.x];

	d_blockB[threadIdx.x]=result[threadIdx.x+pipeSize];

}

Why when i use dynamic shared mem i have 11/12 registers, if static 9 regs per thread.

I want 10 or less registers per thread (occupancy = 100%)

Thanks :)

You have two more pointers in the code using dynamic shared memory…

If you really care, you could check the generated ptx assembly. On the other hand, before worrying too much I’d check how much speed you actually lose with less then 100% occupancy. I believe achieving 100% occupancy is often overvalued, beyond about 2/3 or so the gain often becomes marginal.

shared uint result[1024];

shared uint blockX[512];

result and blockX are pointers.

uint* result;

uint* blockX;

too two pointers.

Occupancy is important for me.

Can I insert ptx code into cu file ?

Thanks !

No - syntactically you can use them as pointers, but no actual pointers are defined by these statements (you cannot assign values to them). And the compiler should resolve these these to constant addresses in memory.

These are two actual pointers, yes.

The discussion is slightly complicated be the aggressive optimization the compiler does. But the compiler cannot know beforehand with which value of [font=“Courier New”]pipesize[/font] the kernel will be called, so it cannot optimize the pointers away. You might try what happens if you replace [font=“Courier New”]pipesize[/font] with a constant in the pointer calculations.

Yes, you can insert inline assembly using [font=“Courier New”]asm()[/font] statements, although I don’t see how that going to help here. What I wanted to suggest was running nvcc with the -ptx flag and looking at the differences of the two kernels.