Is shared memory released when it isn’t used? For example, if a device function call allocates 6k shared, reads data in from global, operates on it, and writes it back out, will the shared memory be unavailable to other blocks for the duration of the global function launch, or will it be released?
I have a global function which, due to unavoidable noncoalesced memory writes, may spend a lot of time idling. At one point, I have to shuffle the contents of three matrices which reside in global memory (this part can be coalesced). The data is associated across these matrices, so they have to be shuffled in the same way. At this point, my algorithm goes like this:
- Fill shift with random numbers [0,31].
- For each matrix:
- Load one 32x32 matrix of ints into shared.
- Shift each row by shift[i].
- Transpose the matrix.
- Write back to global.
A quick test shows that this somewhat naive shuffle works well enough for what I’m using it for, although if there’s a better implementation staring me in the face I’m happy to use it. However, I’m concerned about the shared memory; the other sections of code use very little shared memory and not too many registers, so I could potentially use many blocks per execution unit to overcome the memory latency, but this would appear to be effective only if the scheduler knew that it didn’t have to preserve the unused shared memory. Or am I reading something wrong?
Thanks for your help.
 I haven’t finished with this (porting flam3 to CUDA), so I can’t say for sure, but I’m pretty sure this will be true.