Using Shared Memory in CUDA C/C++

Originally published at:

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is…

When I try using multiple dynamic shared memory arrays with different types, in the way you've shown, I get nvcc (understandably) complaining 'error: a value of type "float *" cannot be used to initialize an entity of type "int*" '. Am I doing something wrong?

Hi David, I had left out the required casts in the code you were referring to. I've updated it now. Try casting to the type of pointer you are assigning to, as the code now shows.

Thanks Mark, I should have spotted that needs to be explicit in CUDA! My code works as expected now.

if am working with an image data and I wish to copy entire rows in one shot what type of syntax should i use

Hi do we need to align data in shared memory?

Hi, could you please share how to extend the dynamic shared memory allocation in __device__ functions? Is there a way we can do something similar?

Dynamic allocation of shared memory is once-per-kernel. In other words, you specify the dynamic amount of memory per block in your kernel launch configuration, as the dynamic shared memory section above shows. But you can declare an unsized shared array anywhere in the code that the kernel runs -- in the __global__ function itself, or in a __device__ function. Or you can declare it in one function and pass it as an argument (by pointer) to another.

Hi Mark. In the second paragraph, you wrote 'Because it is on-chip, shared memory is much faster than local and global memory'. But I think local memory is faster than shared memory. Was that a slips of pen?

Hi Zishu! I definitely meant what I wrote. :) CUDA "local" memory is basically where per-thread stack data goes, and it is in the same physical location as CUDA "global" memory. See the CUDA programming guide ( You may be thinking of OpenCL, which uses the term "local" memory to refer to the equivalent of CUDA's "shared" memory. The difference in terminology is unfortunate.

Hi Mark. I've found the following messages in the nvidia docs of compute capability 5.x devices (in fact, 2.x and 3.x are almost the same, and here is the link:

"A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, for read accesses, the word is broadcast to the requesting threads and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined)."

Dose it mean that there is no bank conflict in a warp? But you know that there is no bank conflict among warps since there is only one warp executing at any time ( As a result, could I assume that there is no bank conflict in such device?

The key phrase in that sentence is "two threads that access any address within the same 32-bit word". A single 32-bit word is by definition provided by a single bank. But since multiple threads access the *same* word, it can be provided by the single bank in a single cycle. Bank conflicts only occur when *different* threads in the *same* warp access *different* words that map to the *same* bank. :) Does that help?

All the doubts are gone. It does help. Thank you very much.

Hi Mark,

What is the default value of shared memory after initialization. For example
__shared__ float sdata[4];

what are the values of sdata[0], sdata[1], sdata[2], sdata[3] right after the above line of code?

Many thanks,

Same as any uninitialized variable: indeterminate. C99 section 6.7.8 "Initialization: If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate."

Hi Mark
please we have this alarm
'Unrelease Share Memory Blocks(35026)

Sorry, I can't diagnose the problem without more information.

Hi mark,

I am trying to use two shared memory arrays in one kernel but with same data type. I am confused how to use it. I tried above approch you have mentioned but it is not working. can you help me?

Hey I have figured it out. Any way thank you for the information

Hi Mark, Thank you as usual fro the great post, I am mainly interested in the cache/shared memory configuration, I wonder is it expensive to set that configuration on a per kernel basis?
If I have an application running 20-30 cuda kernels or more per frame, with target real-time (24-30~ fps), am I going to pay a heavy overhead to set cache memory configuration per kernel?
For example if I am not using shared memory at all, is it good practice to minimize the shared memory size and maximize L1 cache to try get better performance?

Best Regards