Shared memory write performance

Is there a difference in shared memory write performance if address with in a warp are sequential vs random No bank conflicts in both cases

For example: Assuming a float write
Sequential array indexes in a warp: 0,1,2,3,4,5…31
Random array indexes in a warp: 20,21,30,0,16,…1,2

Both accesses wont result in a bank conflict. But any difference in Performance ?

Not for newer GPUs.

I have seen that link. it doesnt clearly say about write performance. GPU HW could optimize if the addresses in the threads are sequential for example: Send base address and type to SLM unit and get all the data. Savings on bus bandwidth. If the addresses are jumbled across threads it gets tricky for HW to optimize.

afaik, shared memory has 32 banks which are essentially jusy independent memory spaces. each bank can perfrorm 1 read or 1 write per cycle, and this doesn’t depend on what other banks are doing

Its not related to SLM banks performance. That is same in both cases. In case of write CUDA cores need to send both address and data to Shared Memory . I’m wondering if there is any address optimization happens in sequential case so as to send less addresses. I just see STS instruction generated for both cases. Has anyone see any variations of STS instruction ?

What kind of variations are you thinking of?

Somthing like an STS.S if compiler could determine that the addresses are sequential.