Shared memory write performance

rgirish · April 17, 2017, 10:58pm

Is there a difference in shared memory write performance if address with in a warp are sequential vs random No bank conflicts in both cases

For example: Assuming a float write
Sequential array indexes in a warp: 0,1,2,3,4,5…31
Random array indexes in a warp: 20,21,30,0,16,…1,2

Both accesses wont result in a bank conflict. But any difference in Performance ?

Robert_Crovella · April 17, 2017, 11:05pm

Not for newer GPUs.

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x__examples-of-irregular-shared-memory-accesses[/url]

rgirish · April 17, 2017, 11:46pm

I have seen that link. it doesnt clearly say about write performance. GPU HW could optimize if the addresses in the threads are sequential for example: Send base address and type to SLM unit and get all the data. Savings on bus bandwidth. If the addresses are jumbled across threads it gets tricky for HW to optimize.

BulatZiganshin · April 18, 2017, 12:01am

afaik, shared memory has 32 banks which are essentially jusy independent memory spaces. each bank can perfrorm 1 read or 1 write per cycle, and this doesn’t depend on what other banks are doing

rgirish · April 18, 2017, 12:10am

Its not related to SLM banks performance. That is same in both cases. In case of write CUDA cores need to send both address and data to Shared Memory . I’m wondering if there is any address optimization happens in sequential case so as to send less addresses. I just see STS instruction generated for both cases. Has anyone see any variations of STS instruction ?

njuffa · April 18, 2017, 12:37am

What kind of variations are you thinking of?

rgirish · April 18, 2017, 4:58pm

Somthing like an STS.S if compiler could determine that the addresses are sequential.

Topic		Replies	Views
Bytes in shared memory CUDA Programming and Performance	8	3058	April 19, 2017
shared memory writes CUDA Programming and Performance	6	3146	December 30, 2007
When bank conflicts in shared memory, serialized request is the order fixed? CUDA Programming and Performance cuda	4	35	August 12, 2024
Explanation of Shared Memory Bank Conflicts for Reduction Example? CUDA Programming and Performance	3	7744	March 14, 2010
Constant memory access Using banks like the shared memory? CUDA Programming and Performance	4	4465	January 6, 2009
Does shared memory have "broadcast" behavior? CUDA Programming and Performance	8	3680	June 12, 2022
Confirm that coalescence does not matter for __shared__ access? CUDA Programming and Performance	3	259	November 20, 2023
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	978	November 17, 2017
the relation between Thread Index and Shared Memory CUDA Programming and Performance	4	3240	February 14, 2009
Global memory broadcast CUDA Programming and Performance	2	9078	July 4, 2011

Shared memory write performance

Related topics