Shared memory: released when unneded?

srobertson · July 20, 2008, 9:55pm

Is shared memory released when it isn’t used? For example, if a device function call allocates 6k shared, reads data in from global, operates on it, and writes it back out, will the shared memory be unavailable to other blocks for the duration of the global function launch, or will it be released?

Background:

I have a global function which, due to unavoidable noncoalesced memory writes, may[1] spend a lot of time idling. At one point, I have to shuffle the contents of three matrices which reside in global memory (this part can be coalesced). The data is associated across these matrices, so they have to be shuffled in the same way. At this point, my algorithm goes like this:

Fill shift[32] with random numbers [0,31].
For each matrix:

- Load one 32x32 matrix of ints into shared.
- Shift each row by shift[i].
- __syncthreads.
- Transpose the matrix.
- __syncthreads.
- Write back to global.

A quick test shows that this somewhat naive shuffle works well enough for what I’m using it for, although if there’s a better implementation staring me in the face I’m happy to use it. However, I’m concerned about the shared memory; the other sections of code use very little shared memory and not too many registers, so I could potentially use many blocks per execution unit to overcome the memory latency, but this would appear to be effective only if the scheduler knew that it didn’t have to preserve the unused shared memory. Or am I reading something wrong?

Thanks for your help.

[1] I haven’t finished with this (porting flam3 to CUDA), so I can’t say for sure, but I’m pretty sure this will be true.

SPWorley · July 20, 2008, 10:27pm

Yes, the device will definitely reuse shared memory. If this didn’t happen, you’d die quickly with any kernel which used many blocks!

An MP may run one or more blocks at once, mostly based on register and shared memory limits. More blocks is always more efficient if they fit… higher thread counts hide latencies.

After one block finishes, another block (if any are waiting) will be dropped in and get the “old” memory. It’s super-efficient because all blocks for a kernel use the exact same amount of shared memory so they can just be simple swaps.

What you can’t do is use shared memory dynamically, where a block chooses how much it needs at runtime, or allocates and frees it. That’s not the question you’re asking, but it’s a common FAQ.

Your real problem is that you’re using shared memory to hold a LOT of data… 32 by 32 uses 4K of memory. So you can only run one block at a time on G80 and G90, and three blocks at a time on G200. Ouch. You want to use as little shared memory as possible to allow as many warps and/or blocks to run… the more you run, the more you can hide your global memory latency. It’s a tradeoff for sure. You may do a lot better to deal with smaller stripes of data even if it means your writes aren’t coalesced… testing may be the only way to tell if you’re being hurt more by read/write speed or latency, but with just one block running at once, I’d bet you’re latency limited right now.

tmurray · July 21, 2008, 1:43am

There’s no difference in shared memory size on G80 and GT200–both have 16KB. GT200 has double the registers.

SPWorley · July 21, 2008, 3:06am

Yep! You’re right, I got them backwards in my head. Thanks for the catch.

Loading a 32x32 array into shared will still take up a lot of the shared space and be a big limitation, though… only 3 blocks could loaded at once. Testing would show if it’s still latency or throughput limited.

srobertson · July 25, 2008, 7:13pm

Thanks, that made things a lot clearer. I assumed that it was possible to swap blocks in and out of a processor, much like threads on GPCPUs, but I reread the documentation after reading your posts and understand now.

Topic		Replies	Views
Some confusion on using shared memory. CUDA Programming and Performance	26	9471	June 2, 2009
Execution Of Thread-Blocks CUDA Programming and Performance	4	5376	June 18, 2007
Shared memory per block Related to shared memory of an MCPU CUDA Programming and Performance	3	4099	August 14, 2007
Shared Memory Buffer CUDA Programming and Performance	1	2746	May 13, 2011
why is this valid in Emu: 300 blocks with Ns=1k shared memory CUDA Programming and Performance	9	4314	March 6, 2007
Kernel Execution issues related to Shared Memory CUDA Programming and Performance	5	5288	November 9, 2009
how can i free(delete) arrays in shared memory?? Legacy PGI Compilers	6	6778	February 16, 2012
How to efficiently use shared memory? CUDA Programming and Performance	2	1218	September 29, 2015
Expanding shared memory into global memory? CUDA Programming and Performance	3	1610	August 3, 2009
Cant understand Shared Memory Concept ! I want to talk Live to somebody who knows it !!& CUDA Programming and Performance	2	1527	April 13, 2009

Shared memory: released when unneded?

Related topics