I’m working with an algorithm that recursively splits an array into smaller sections until they will fit into shared memory and then processes them in shared memory. Small sections must be powers of 2.
I’ve noticed that a bit of shared memory seems to be reserved when a kernel is called. For example, I get out of memory errors when doing something like this:
int device;
cudaDeviceProp properties;
cudaGetDevice( &device );
cudaGetDeviceProperties( &properties, device );
foo<<< grid, block, properties.sharedMemPerBlock >>> ( );
Subtracting a small amount (~100 bytes) from the specified shared memory works. My algorithm requires powers of 2 for the shared memory size, so by subtracting anything from the total shared memory on a specific card, I have to drop down to half as much.
Is there any way to reclaim this memory?
I’ve looked around in the PTX assembly and didn’t see any extra declarations so I figure that it is being either reserved by the JIT or the runtime. Is there a flag I can give to either of these to use global or local memory instead of shared? If not, will I see any consequences if I just allocate slightly less and clobber a small part of shared mem?