How to use all 16KB shared memory

I’d like to use all of the 16KB shared memory available on the GPU in my CUDA application.

Some of the shared memory is used for the device function parameter block [nvcc_2.2.pdf, p. 37], how much can be displayed with nvcc -Xptxas=-v (the number following the +).

By not passing any parameters to the kernel the device function parameter block can be reduced to 16 bytes:

[codebox]shared char shared[16*1024];

extern “C”

global void kernel()

{

shared[0] = 0;

}

int main()

{

kernel<<<1,1>>>();

printf("%s\n", cudaGetErrorString(cudaGetLastError()));

return 0;

}

[/codebox]

$ nvcc -Xptxas=-v testkernel.cu

ptxas info : Compiling entry function ‘kernel’

ptxas info : Used 1 registers, 16400+16 bytes smem

ptxas error : Entry function ‘kernel’ uses too much shared data (0x4000 bytes + 0x10 bytes system, 0x4000 max)

Is there a way of avoiding those system allocated 16 bytes and using all of the 16KB shared memory in a kernel?

Hi,

Why do you need to use all 16K? why does 16K - [some few bytes] not suffice?

Thats probably indicates some problem with your algorithm/code. Maybe if you share why you need the whole

size, someone can assist in finding an alternate solution.

eyal

16k is certainly a nice round number. I had an application where I wanted each thread to maintain a histogram requiring 256 bytes. I couldn’t use 64 threads per block, and 32 threads per block would allocate only a single block per SM. With 63 threads per block, at least I can get more than one warp per SM, but coalescing is no longer simple, because with the usual way of implementing tid = blockIdx.x*blockDim.x + threadIdx.x the first thread of the second block is no longer aligned properly.

Those few bytes matter. It’s not just bad algorithm design.

Unfortuntely, I don’t know of a way to use all 16k.

The problem I’m working can be divided in two steps:

keyderivation();

rc4encryption();

Each thread performs both steps on independent input data.

The first step (keyderivation) is computationally intensive, but needs little shared memory.

The second step (rc4encryption) does not require much computation, but needs 256 bytes addressable memory, which is accessed bytewise in a random pattern.

I can run the kernel with 63 threads per block using 256*63 = 16128 bytes of shared memory. A blocks size of 64 is not possible as described in the initial post.

In fact I’d like to use 128 (or 127) threads per block if possible.

I’ve tried this:

keyderivation();

__syncthreads():

if ((threadIdx.x & warpSize) == 0)

rc4encryption();

__syncthreads():

if ((threadIdx.x & warpSize) == 1)

rc4encryption(); 

__syncthreads():

The idea is that two threads in different warps share their state and to make sure that those threads execute serially.

I can get a 42% speedup with this method if I assume the full 16KB are available for a test. Having a few bytes less make allocation of shared memory to threads so ugly that I don’t think it is worth while…

Anyway, I believe there are many problems where a few bytes less that a power of 2 can hurt.

It would be great if somebody could explain why the device function parameter block eats shared memory and if there is a way to avoid this.

Long ago someone did some exploration with test kernels accessing shared arrays with negative indices to see what was stored in those initial bytes. I think it may have included blockDim and gridDim (perhaps warpSize now as well?). No idea how to get that memory back without doing something unsupported and dangerous, like deliberately indexing a shared array of size (16 kB - 16 bytes) with a negative offset. While it would be interesting to know if that actually works, I wouldn’t base any real code on it. :)

Hi,

Can you load 128bytes, work on them and then load the additional 128 bytes? I do it and indeed its not the easiest thing

to code,debug and test but it work.

Hey, you could have also wanted a 32K for shared memory, why just 16K? I guess if you needed the full 32K you’d break

it into 2 16Ks, no? Am I missing something here?

Breaking it into a non 2 mul value, is indeed very ugly :)

BTW - what about textures? have you tried using textures ? it has its advantages and you’re not limited to this 16K shared mem limitation.

The extra bytes are blockIdx and gridDim and maybe more/other internal params. Nothing you can do about it :)

eyal

The 16-byte header is partially documented on the PTX ISA document (some obscure notes in the descriptions of %ctaid and %nctaid).

wumpus found the remaining values during his work on decuda.

The layout (copy-pasted from Barra) is:

uint16_t header[8];

header[0] = 0;	   // gridid

header[1] = blockx;   // blockDim.x

header[2] = blocky;   // blockDim.y

header[3] = blockz;   // blockDim.z

header[4] = gridx;	// gridDim.x

header[5] = gridy;	// gridDim.y

header[6] = bidx;	// blockIdx.x

header[7] = bidy;	// blockIdx.y

Where warpid and laneid are stored is an interesting question.

I suspect they are currently turned into constants by ptxas. I need to experiment with CUDA 2.2…

Edit: warpSize is the WARP_SZ constant in PTX, translated into the constant 32 in cubin.

The PTX variables %laneid, %warpid and %smid are extracted from the i0 special register (which contains the whole “physical thread ID”) using bit masking. Trying to use %nsmid in PTX causes a compilation error.

256 bytes is reserved for Parameters. (I think it is a parameter area). Check the CUBIN file for the amount of shared memory occupied by your kernel.

I think -ve indices should work. It would be interesting to experiment…

but dont use such code for production (coz it may NOT work for future or some hardware)

I’ll try if overwriting the reserved bytes works and will let you know the result.

RC4 basically computes a random permutation of 0 … 255. Each step swaps two bytes, which are selected based on the value of a key byte and the permutation so far. I don’t see how that can be split in two.

I think textures are read only, so they won’t help in my case as RC4 needs write access.

Here is an idea how to accomplish this, but it might just as well break horribly:

  1. Compute the start of 16kb shared memory block by applying (the correct) negative offset to

a dynamically allocated 15.75kb block (the maximum you can get). This might require some

experimentation to get right.

  1. Create backup of the first 256bytes of shared memory (e.g. by copying to global mem).

This may have to be done on a per-thread block basis (different blockIdx variables in the

header!)

  1. Now load all grid, and block specific variables like blockIdx.x/y/z,

blockDim, gridDim into registers. e.g. with volatile int blockIdx_x = blockIdx.x;

  1. Do your work in the kernel. Use and Overwrite all 16kb of shared memory. Refrain from

    using blockIdx and other special variables. Use the backup copies we’ve put into registers.

  2. Restore first 256 bytes of shared memory (read from global mem).

  3. Now exit the kernel (thread block)

  4. Hopefully things did not break ;)

Oh, thats nice. Let me see if I understand your algo. You load X bytes of data and then starts iterating over them one by one.

The result of iteration Y depends on some data Z and the result from iteration Y - 1, is that correct ???

If so, you can still use what I suggested - I do it :)

Load whatever amount of data you want/fit in Shared mem/Occupancywise correct/… and then start the calculations.

Once you finish your chunk of calculated values, save the last step’s result, load the next chunk of data and in order

to calculate the first value of the second chunk, use the last step’s result which you just saved. From now on you can

continue again with regular code till you finish the second chunk, and so on…

Hope that makes sense :)

eyal

Oh! That would be great! Thanks!

My few cents:

  1. Just declare a shared memory array.

  2. In your kernel, store the “start” address of the array into global memory and print the address from your application.

  3. Run it again and again with different shared memory sizes and see if it changes.

  4. That will help you understand the notion of “pointers” in shared memory.

  5. Note that shared memory usage is always rounded to nearest “512 byte” boundary (or some boundary) - check the CUDA Occupancy XLS sheet.

Good Luck!

Best Regards,

Sarnath

Here how it seem to work for me:

  1. dump all the data you need out of your parameters and vars blockIdx, blockDim, gridDim to registers (using the ‘volatile’ trick)

  2. syncthreads before writing to sm

  3. subtract 16 + sumoflocalvars bytes from your sm pointer and trash all your shared memory

__global__ void func(float4* _o, int _n)

{

  extern __shared__ float _sd[];

float* sd = _sd - 4 - 2/*num of args*/;

// remember input args in registers

  float4 *  volatile o = _o;

  int volatile n = _n;

// wait for all threads to finish dumping args to registers

  __syncthreads();

// trash your shared memory and be happy

  memset(sd, 0, 16384);

...

}

This seem to work alright for me.

So you are not restoring the contents of SM after trashing it?

This means the block scheduler seems to explicitly reset the Shared Memory contents (16 byte header and arguments list)

for each block that is scheduled on a CUDA processor. Without this “feature”, any subsequent scheduled blocks would find

a very broken arguments list and grid dimensions in SM -> segmentation fault / unspecified launch failure.

This could be one of the reasons for notable kernel call overhead: There are apparently a lot of writes to SM by the block

scheduler.

I think one needs to be careful with CUDA built-in functions (inlined functions). If any declares a local shared memory

variable, there would be issues after trashing all 16kb. Local memory should be safe though.

Christian

No, I don’t restore anything. I assume that it gets restored by the scheduler, and of course I can be horribly horribly wrong. So far things work alright, but it could be pure luck for me with my kernel. If your kernel crashes or produces unexpected results for unknown reason, then one should consider such behaviour and try moving input parameters to some other memory or try restoring them.

@Sergey,

The code looks coool. Probably, people who have complex kernel can check out and post here.

Yes, I’ll also remember to post here if I find out something else. Btw, keep in mind that there is currently a bug in nvcc when you save pointers to registers with volatile trick, compiler looses track of pointer alignment, and following loads or stores from that pointer end up being a sequence of 4byte reads/writes, instead of 1 8 byte for float2 or 16 byte for float4 read or write.

If the 256 bytes are used as a lookup-table, or something similar, I don’t see how this helps. Can you elaborate on how this can work?

Another thing you want to watch out for is bank conflicts. If the indices are “random” then there will be some unavoidable conflicts, but it looks like one of the indices is sequential, so you may want to use a 1024 byte block to store tables of 4 threads interleaved. For a memory-intense algorithm, bank conflicts can make a big difference.

Hi Jamie,

My point was that if someone cant squeeze his data into 16K, those extra few bytes (overriding block indexes, grid dims etc…)

are probably won’t help too much either. Why does an algorithm need exact 16K and not for example 32K? what would one

do then?

I suggested to break the 16K (or whatever amount of data you need) into chunks that will fit into the smem, while giving

you good occupancy on the way. The chunks could overlap, something like this:

// lets say we need to load 16K data

__shared__ int iPos;

__shared__ float sdata[ 256 * 4 ];

if ( 0 == threadIdx.x ) iPos = 0; __syncthreads();

while ( iPos < 16 * 1024 )   // Or some other boolean condition...

{

	// Load current chunk of 256 elements. Note that the kernel was called with 256 threads.

	sdata[ threadIdx.x ] = pGMEMData[ .... + iPos + threadIdx.x ];

	__syncthreads();

	// Do whatever you need with the data...

	...

	// Move to the next chunk - if overlapping don't increment by 256 but less and write the data to pGMEMData

   // so next iteration will re-read the data calculated on the current iteration.

   if ( 0 == threadIdx.x )

   { 

	  iPos += 256;

	}

	__syncthreads():

}

Bank conflicts are surely an issue, I usually didnt bother with them too much since I got good speedups. I guess

it should be taken into account in such case.

what do you think?

eyal