Confirm that coalescence does not matter for shared access?

dscerutti · November 6, 2023, 5:03pm

Hello, just checking to make sure that it’s a matter of which banks the threads of a warp are accessing when the addresses are in __shared__, not that all of the addresses fit neatly in a contiguous range of 32. Is that correct? If my threads are accessing indices 5, 38, 71, 104, …, 995, 1028 of an array (for simplicity, the array elements have size four bytes), those accesses are going to be handled by banks 5, 6, 7, …, 31, 0, 1, 2, 3, 4. That should be the same performance as a warp accessing elements 5, 6, 7, 8, …, 35, 36 of the same array, no? Of course, when accessing global the latter is OK (best would be 0, 1, …, 30, 31) but the former is the absolute worst–every thread taking its own cache line.

Cheers.

Robert_Crovella · November 6, 2023, 5:10pm

correct. For maximum throughput to shared memory, the rule is, considering a warp-wide access, we want no more than one item per bank requested. It is not necessary that all addresses be contiguous. Shared memory generally also has the broadcast rule. That means that if there are multiple requests to the same bank, but they are also to the same location, this does not reduce efficiency. A particular location can be broadcast to multiple threads in a warp, per transaction, at no additional cost.

dscerutti · November 6, 2023, 5:13pm

Thanks, Robert. I’ve been doing some head math and thinking that I may be piling things up, much more than is needed, on a handful of the __shared__ banks. And I have another thing that I want to do that will be even more powerful if I don’t need to worry about keeping all the accesses adjacent per se.

system · November 20, 2023, 5:14pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
shared memory accesses for different compute capabilities CUDA Programming and Performance	2	2845	July 29, 2011
Requesting clarification for Non contiguous shared memory access by threads of a warp with no bank conflicts CUDA Programming and Performance hw , cuda	5	405	February 21, 2024
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	981	November 17, 2017
Conflict in shared memory CUDA Programming and Performance	5	5820	November 16, 2010
Using lookup table in constant memory CUDA Programming and Performance	5	3334	January 10, 2019
Constant memory access Using banks like the shared memory? CUDA Programming and Performance	4	4470	January 6, 2009
bank conflict in cuda's parallel prefix scan GPU-Accelerated Libraries	1	1894	February 12, 2016
Bytes in shared memory CUDA Programming and Performance	8	3064	April 19, 2017
Can threads from different warps access shared memory at the same time? CUDA Programming and Performance	4	671	April 22, 2024
the relation between Thread Index and Shared Memory CUDA Programming and Performance	4	3244	February 14, 2009

Confirm that coalescence does not matter for __shared__ access?

Related topics

Confirm that coalescence does not matter for shared access?