Using Shared Memory in CUDA C/C++

jwitsoe · October 10, 2013, 1:31am

Originally published at: https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is…

anon4101334 · July 21, 2014, 4:01pm

When I try using multiple dynamic shared memory arrays with different types, in the way you've shown, I get nvcc (understandably) complaining 'error: a value of type "float *" cannot be used to initialize an entity of type "int*" '. Am I doing something wrong?

anon95180265 · July 21, 2014, 11:53pm

Hi David, I had left out the required casts in the code you were referring to. I've updated it now. Try casting to the type of pointer you are assigning to, as the code now shows.

anon4101334 · July 22, 2014, 2:00pm

Thanks Mark, I should have spotted that needs to be explicit in CUDA! My code works as expected now.

anon8145549 · November 22, 2014, 4:50pm

if am working with an image data and I wish to copy entire rows in one shot what type of syntax should i use

anon7122764 · November 24, 2014, 10:34pm

Hi do we need to align data in shared memory?

anon55677662 · September 28, 2015, 8:23am

Hi, could you please share how to extend the dynamic shared memory allocation in __device__ functions? Is there a way we can do something similar?

anon95180265 · September 29, 2015, 1:18am

Dynamic allocation of shared memory is once-per-kernel. In other words, you specify the dynamic amount of memory per block in your kernel launch configuration, as the dynamic shared memory section above shows. But you can declare an unsized shared array anywhere in the code that the kernel runs -- in the __global__ function itself, or in a __device__ function. Or you can declare it in one function and pass it as an argument (by pointer) to another.

anon95782857 · October 10, 2015, 3:10am

Hi Mark. In the second paragraph, you wrote 'Because it is on-chip, shared memory is much faster than local and global memory'. But I think local memory is faster than shared memory. Was that a slips of pen?

anon95180265 · October 11, 2015, 11:25pm

Hi Zishu! I definitely meant what I wrote. :) CUDA "local" memory is basically where per-thread stack data goes, and it is in the same physical location as CUDA "global" memory. See the CUDA programming guide (https://docs.nvidia.com/cud.... You may be thinking of OpenCL, which uses the term "local" memory to refer to the equivalent of CUDA's "shared" memory. The difference in terminology is unfortunate.

anon35763269 · November 7, 2015, 3:31am

Hi Mark. I've found the following messages in the nvidia docs of compute capability 5.x devices (in fact, 2.x and 3.x are almost the same, and here is the link: https://docs.nvidia.com/cud....

"A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, for read accesses, the word is broadcast to the requesting threads and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined)."

Dose it mean that there is no bank conflict in a warp? But you know that there is no bank conflict among warps since there is only one warp executing at any time (http://stackoverflow.com/qu.... As a result, could I assume that there is no bank conflict in such device?

anon95180265 · November 7, 2015, 5:13am

The key phrase in that sentence is "two threads that access any address within the same 32-bit word". A single 32-bit word is by definition provided by a single bank. But since multiple threads access the *same* word, it can be provided by the single bank in a single cycle. Bank conflicts only occur when *different* threads in the *same* warp access *different* words that map to the *same* bank. :) Does that help?

anon35763269 · November 7, 2015, 7:04am

All the doubts are gone. It does help. Thank you very much.

anon42890877 · September 23, 2016, 10:33am

Hi Mark,

What is the default value of shared memory after initialization. For example
...
__shared__ float sdata[4];
...

what are the values of sdata[0], sdata[1], sdata[2], sdata[3] right after the above line of code?

Many thanks,
Thanasis

anon95180265 · September 25, 2016, 11:34pm

Same as any uninitialized variable: indeterminate. C99 section 6.7.8 "Initialization: If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate."

anon8112817 · October 28, 2016, 1:59pm

Hi Mark
please we have this alarm
'Unrelease Share Memory Blocks(35026)

anon95180265 · October 31, 2016, 2:29pm

Sorry, I can't diagnose the problem without more information.

anon33988744 · November 6, 2016, 3:22pm

Hi mark,

I am trying to use two shared memory arrays in one kernel but with same data type. I am confused how to use it. I tried above approch you have mentioned but it is not working. can you help me?

anon33988744 · November 6, 2016, 3:41pm

Hey I have figured it out. Any way thank you for the information

anon47027051 · March 3, 2017, 2:46am

Hi Mark, Thank you as usual fro the great post, I am mainly interested in the cache/shared memory configuration, I wonder is it expensive to set that configuration on a per kernel basis?
If I have an application running 20-30 cuda kernels or more per frame, with target real-time (24-30~ fps), am I going to pay a heavy overhead to set cache memory configuration per kernel?
For example if I am not using shared memory at all, is it good practice to minimize the shared memory size and maximize L1 cache to try get better performance?

Best Regards

M.

Topic		Replies	Views
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
Using Shared Memory in CUDA Fortran Technical Blog	0	391	August 25, 2020
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29325	April 23, 2010
How to Access Global Memory Efficiently in CUDA C/C++ Kernels Technical Blog	7	642	December 5, 2019
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7260	January 25, 2010
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11685	March 20, 2009
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4698	June 22, 2011
Shared memory question CUDA Programming and Performance	27	7322	June 23, 2008
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5078	September 6, 2008
Efficient use of shared memory CUDA Programming and Performance	29	4472	December 2, 2019

Using Shared Memory in CUDA C/C++

Related topics