Max shared memory

I’m working with an algorithm that recursively splits an array into smaller sections until they will fit into shared memory and then processes them in shared memory. Small sections must be powers of 2.

I’ve noticed that a bit of shared memory seems to be reserved when a kernel is called. For example, I get out of memory errors when doing something like this:

int device;

	cudaDeviceProp properties;

	

	cudaGetDevice( &device );

	cudaGetDeviceProperties( &properties, device );

	foo<<< grid, block, properties.sharedMemPerBlock >>> ( );

Subtracting a small amount (~100 bytes) from the specified shared memory works. My algorithm requires powers of 2 for the shared memory size, so by subtracting anything from the total shared memory on a specific card, I have to drop down to half as much.

Is there any way to reclaim this memory?

I’ve looked around in the PTX assembly and didn’t see any extra declarations so I figure that it is being either reserved by the JIT or the runtime. Is there a flag I can give to either of these to use global or local memory instead of shared? If not, will I see any consequences if I just allocate slightly less and clobber a small part of shared mem?

If I remember right, shared memory is used to store thread/block IDs as well as kernel parameters. It’s not much, but it does steal those bytes from your total. If you search on this forum you might find a discussion about a year ago where someone figured out those extra values and their position and padding. There’s likely no way around the loss…

One way to help might be to split your data a few more times to allow multiple simultaneous block execution. You can’t use a block of 16K, but perhaps 7 blocks of 2K would work… it depends on whether your algorithm would be more efficient as 7 blocks of 2K or 1 block of 8K.

According to page 28 of nvcc2.[01].pdf thread/grid index information is stored in local memory. This is also different from what I thought…