Using Shared Memory in CUDA C/C++

anon95180265 · March 5, 2017, 10:51pm

Hi Marco,

As the API docs explain, currently running code may be forced to finish before the setting can be changed on the hardware. So changing frequently between kernels (flipping back and forth between max L1 and SMEM) could introduce significant pipeline bubbles depending on the running time of each kernel.

My advice is to not prematurely optimize for the size of L1 or SMEM: use the optimization when you know it is beneficial (i.e. if you know you aren't bottlenecked by memory it will have much lower benefit). Test the setting for performance benefit on each kernel on each GPU architecture.

anon47027051 · March 6, 2017, 3:38am

Yes, thank you for the reply, I found it later on the doc that changing this settings might trigger a device sync. Currently the kind of work load we have does not leverage shared memory due to the kind of alg it deals with which would allow to easily set the option card wide to get an initial bench-marking. Thank you for getting back to me, as usual, thank you for the great article.

anon19167590 · March 27, 2017, 4:27pm

Hi Mark. A very interesting article. Very useful for learning. But I'd like to know how you would reorder an bigger array. For example, if my device has a maximum threads per block size of 1024, does it mean that I can only reorder an 1024-element array at most?

Many thanks.

anon53073872 · August 3, 2017, 9:38am

Do not be misunderstood with LINUX shared memory ( quite useful way for managing data transfers to device ) in real-word already-made code accelerations. Theoretically it is joinable with java programming language, but I am not sure. Some trivial C++ example:
https://github.com/PiotrLen...

Post Scriptum:. quite useful in distributed manner computations in client-server application.

anon97798621 · September 13, 2017, 5:18pm

Hey did you figure out the answer?

anon83339622 · November 5, 2017, 7:11pm

Hi Mark. The weirdest thing is happening. I declared a shared array in a global kernel, set some values into it, and whenever I try to access it, it returns a value of zero. The only time it returns a value is if I access the shared array with the thread index. Is this common? My head's seriously spinning over this.

anon95180265 · November 10, 2017, 6:55pm

It's hard to debug code I can't see. If you are writing to the location with one thread and reading the same location from another, then you must synchronize between the accesses (__syncthreads()), or else you have a race condition which results in undefined behavior.

anon83339622 · November 14, 2017, 7:01pm

Thanks for answering. I did __synthreads() before and after, and I also did it in a "if(id ==0)" condition, to no avail. I suspected a bad installation on my end. But before reinstalling Visual Studio and CUDA, I changed the __shared__ array to a normal one stored in DRAM since it will only be accessed sqrt(n) times in total in an execution. Thank you for your time with me.

anon79272426 · February 1, 2018, 9:09am

thank you Mr Harris, These Discussions Are Very Helpful...
but my question is what if I want to use of static reverse function in different streams?
how should I specify the size of shared memory?
I think it would be something like this:

<<<k,t,64*sizeof(int),s1>>>(...)
after specifying the size of shared memory It seems I'm using of dynamic reverse version! is it true?

anon90384534 · April 12, 2018, 10:37pm

Hi Mark, I tried your dynamic allocation approach for multiple arrays. But the complier says nC and nF are undefined. Should I define them before calling the kernel?

anon52651513 · October 27, 2018, 4:02pm

Are bank conflicts still something to look out for in the newest architectures (Turing, Pascal etc.)?

anon95180265 · October 28, 2018, 11:14pm

Yes, although in the grand scheme of things they are a micro-optimization in most kernels.

anon95180265 · October 28, 2018, 11:14pm

No, you can launch a LOT of blocks. And loops also work just fine in CUDA C/C++. So your problem size is not limited.

anon77772973 · November 8, 2019, 3:27am

Hello Mark

This is quite informative.

Could you please specify which metrics I can use from the profiler tools which can hint at shared memory bank conflicts?

Also, I have been trying to figure out the metrics which could signify cache misses in a CUDA application. It would be really helpful if you could tell which ones would help me!

anon95180265 · November 8, 2019, 7:33am

In NSight Compute, you can collect e.g. the `Memory Workload Analysis Tables` section, which includes detailed information on shared memory usage. https://uploads.disquscdn.c...

The Raw page will show you which exact metrics are collected as part of this `group:memory__shared_table`. The exact metrics can change depending on which GPU is targeted. e.g.

```
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum
l1tex__data_pipe_lsu_wavefronts_mem_shared_cmd_read.sum
l1tex__data_pipe_lsu_wavefronts_mem_shared_cmd_read.sum.pct_of_peak_sustained_active
l1tex__data_pipe_lsu_wavefronts_mem_shared_cmd_write.sum
l1tex__data_pipe_lsu_wavefronts_mem_shared_cmd_write.sum.pct_of_peak_sustained_active
sass__inst_executed_shared_loads
sass__inst_executed_shared_stores
smsp__inst_executed_op_shared_atom.sum
```

johannm · October 7, 2020, 9:52am

From my understanding, there are 4 warp schedulers per SM and means 4 warps can execute concurrently in a single SM, if possible. If you use 32-bit mode as in [1] on a device that supports 64-bit transactions, it says that no bank conflict is created when two 32-bit addresses are accessed in the same 64-bit word as it maps to one memory bank and can be multicasted to the two threads in the same warp. This means in total only 16 banks need to be accessed by one warp.

My question is thus: is it possible for another warp to access the latter 16 banks concurrently? I.e. will using 32-bit floats double my throughput from shared memory when compared to using 64-bit floats? (in case it makes a difference I’m using a C.C. 7.5 device)

[1] Programming Guide :: CUDA Toolkit Documentation

johannm · October 8, 2020, 1:30pm

Upon further reading, I discovered that 64-bit mode is only supported for C.C. 3.0 and was changed in C.C. 5.0 and newer to only support 32-bit mode. So in my case (C.C. 7.5), using doubles will result in bank conflicts and 2 transactions from shared memory will be required.

[1] Best Practices Guide :: CUDA Toolkit Documentation

Topic		Replies	Views
Using Shared Memory in CUDA Fortran Technical Blog	0	419	August 25, 2020
Some confusion on using shared memory. CUDA Programming and Performance	26	9371	June 2, 2009
Shared memory question CUDA Programming and Performance	27	7634	June 23, 2008
Coalescing global memory and avoiding shared bank conflicts Do I need to use this complex of indexin CUDA Programming and Performance	3	3262	March 30, 2009
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	5046	January 23, 2024
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6814	February 8, 2009
beginner question regarding shared memory CUDA Programming and Performance	4	7011	November 16, 2009
Shared memory problem CUDA Programming and Performance	10	4097	April 20, 2010
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	14800	November 19, 2025
Newbie - Need to use shared mem? CUDA Programming and Performance	27	15239	December 17, 2008

Using Shared Memory in CUDA C/C++

Related topics