Shared memory: released when unneded?

Is shared memory released when it isn’t used? For example, if a device function call allocates 6k shared, reads data in from global, operates on it, and writes it back out, will the shared memory be unavailable to other blocks for the duration of the global function launch, or will it be released?

Background:

I have a global function which, due to unavoidable noncoalesced memory writes, may[1] spend a lot of time idling. At one point, I have to shuffle the contents of three matrices which reside in global memory (this part can be coalesced). The data is associated across these matrices, so they have to be shuffled in the same way. At this point, my algorithm goes like this:

  • Fill shift[32] with random numbers [0,31].
  • For each matrix:
    • Load one 32x32 matrix of ints into shared.
    • Shift each row by shift[i].
    • __syncthreads.
    • Transpose the matrix.
    • __syncthreads.
    • Write back to global.

A quick test shows that this somewhat naive shuffle works well enough for what I’m using it for, although if there’s a better implementation staring me in the face I’m happy to use it. However, I’m concerned about the shared memory; the other sections of code use very little shared memory and not too many registers, so I could potentially use many blocks per execution unit to overcome the memory latency, but this would appear to be effective only if the scheduler knew that it didn’t have to preserve the unused shared memory. Or am I reading something wrong?

Thanks for your help.

[1] I haven’t finished with this (porting flam3 to CUDA), so I can’t say for sure, but I’m pretty sure this will be true.

Yes, the device will definitely reuse shared memory. If this didn’t happen, you’d die quickly with any kernel which used many blocks!

An MP may run one or more blocks at once, mostly based on register and shared memory limits. More blocks is always more efficient if they fit… higher thread counts hide latencies.

After one block finishes, another block (if any are waiting) will be dropped in and get the “old” memory. It’s super-efficient because all blocks for a kernel use the exact same amount of shared memory so they can just be simple swaps.

What you can’t do is use shared memory dynamically, where a block chooses how much it needs at runtime, or allocates and frees it. That’s not the question you’re asking, but it’s a common FAQ.

Your real problem is that you’re using shared memory to hold a LOT of data… 32 by 32 uses 4K of memory. So you can only run one block at a time on G80 and G90, and three blocks at a time on G200. Ouch. You want to use as little shared memory as possible to allow as many warps and/or blocks to run… the more you run, the more you can hide your global memory latency. It’s a tradeoff for sure. You may do a lot better to deal with smaller stripes of data even if it means your writes aren’t coalesced… testing may be the only way to tell if you’re being hurt more by read/write speed or latency, but with just one block running at once, I’d bet you’re latency limited right now.

There’s no difference in shared memory size on G80 and GT200–both have 16KB. GT200 has double the registers.

Yep! You’re right, I got them backwards in my head. Thanks for the catch.

Loading a 32x32 array into shared will still take up a lot of the shared space and be a big limitation, though… only 3 blocks could loaded at once. Testing would show if it’s still latency or throughput limited.

Thanks, that made things a lot clearer. I assumed that it was possible to swap blocks in and out of a processor, much like threads on GPCPUs, but I reread the documentation after reading your posts and understand now.