Bytes in shared memory

rgirish · April 19, 2017, 6:54pm

If I put a character array in shared memory do I get the full write bandiwdth. Hardware can do 32 banks * 4 bytes per clock. However for 1 byte characters each consecutive 4 access goes to same bank. Will hardware combine the 4 and treat it as a 4 byte access ?

Robert_Crovella · April 19, 2017, 7:18pm

Yes, that is covered in the programming guide as well, similar to your last question.

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-2-x[/url]

I imagine, similar to your last question, you will argue whether “no bank conflicts” also implies maximum efficiency, supposing that there is some hidden undocumented addtional optimization that can be done by the compiler; I won’t be able to speak to that.

Specifically with respect to this statement:

“Will hardware combine the 4 and treat it as a 4 byte access ?”

The programming guide uses a character array as an example, and says:

“A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, for read accesses, the word is broadcast to the requesting threads (multiple words can be broadcast in a single transaction)”

rgirish · April 19, 2017, 8:26pm

Thanks txbob. My question was on the char writes note from programming guide.
"and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined). "

According to this for a code
extern shared char shared;
shared[threadId] = data
With in a warp, only 8 threads in a warp will actually do the write

In other words, HW or compiler will combine the addresses with in same 32-b word going to one bank.

Robert_Crovella · April 19, 2017, 8:54pm

Yes, the writes will be coalesced. Pay careful attention here, as the word address here means byte address

note the terminology:

"any address within the same 32-bit word "

So for each byte address, only one writing thread will win. However separate byte addresses written to within the same 32-bit word will be coalesced by the hardware, in the same way that all other forms of coalescing occur. (You could easily test this, right?)

Remember (except for atomics) the actual transaction size is always either 32 bytes or 128 bytes. The transactions are coalesced to one of these two sizes, per instruction, across a warp.

You’ve also mentioned the compiler a few times when talking about inter-thread behavior. I suggest this is a non-concept. The compiler works with the programming model of a thread, not across threads. The compiler knows nothing about the fact that there are other adjacent threads (or not).

In your example, I don’t agree with this claim:

“With in a warp, only 8 threads in a warp will actually do the write”

rgirish · April 19, 2017, 9:10pm

You could easily test this, right?)

Yes I tested and it does coalesce. What I cant say for sure is did compiler detect the sequentiality between threads and generated a specific STS instruction or HW did the coalescing.

The compiler knows nothing about the fact that there are other adjacent threads (or not).

Well threads in CUDA are just SIMD lanes which are 32 wide. Compiler will definitely look for optimizations that can be done across lanes. Uniform loads are best example.

njuffa · April 19, 2017, 9:18pm

The compiler has a single-thread view of the world, because at compile time, nothing can be known about the run-time configuration (blocks, threads) used to execute the code.

For example, the compiler will track data dependencies strictly based on matching up addresses, and knows nothing about implicit inter-thread data dependencies as they frequently occur in reductions. In the absence of explicitly expressed data or control dependencies, the compiler is free to re-arrange operations that have been established as independent based on code analysis.

In general, processing data one byte at a time is inefficient because it drives up the dynamic instruction count. That is true for any processor using registers that are more than one byte wide. SIMD-in-a-register techniques can often be used to speed-up byte-wise processing.

rgirish · April 19, 2017, 10:16pm

Right compiler doesn’t know run time block/threads config. However it knows 32 threads with in a warp and can do optimizations. What I’m talking about is optimizations with in a warp (SIMD32 lane). Not across warps.

Yes one byte at a time is inefficient. Hence they could be doing compiler coalsecing with in a warp or HW merging.

njuffa · April 19, 2017, 11:28pm

The compiler already knows how to vectorize loads and stores. However, since loads and stores have to be naturally aligned on GPUs (this is, or the most part, different from x86), it can only do so if it can prove proper alignment of the wider access. That is not always possible. Even with vectorized loads and stores you would still be looking at extra instructions to handle data one byte at a time.

I have worked on C-string libraries for x86 and SPARC, and that code operates on the data 4 bytes or 8 bytes at a time as much as possible. For image processing and gene sequencing, CUDA’s 32-bit byte-wise SIMD intrinsics often work well and are a perfect match for the 32-bit registers.

The magical do-what-I-mean optimization in compilers remains elusive. If you really care for performance, at some point you will have to write code accordingly.

rgirish · April 19, 2017, 11:45pm

I made the compiler generate STS.U8 (1 byte store per lane) instruction. This results in 1 SLM transaction which confirms HW is coalescing the bytes to a 32b chunk

If HW was not coalescing it would have resulted in 8 SLM transactions for a STS.U8 instruction.

Topic		Replies	Views
char global memory access optimization CUDA Programming and Performance	17	11872	May 31, 2010
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	974	November 17, 2017
Using Shared Memory in CUDA C/C++ Technical Blog	36	1994	October 8, 2020
Quick memory access question. Threads fighting over a data source? CUDA Programming and Performance	9	4055	October 20, 2008
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2629	March 31, 2010
Conditions of coalescing global memory into few transactions CUDA Programming and Performance	3	678	December 23, 2019
Shared memory bytewise memory write guarantees CUDA Programming and Performance	3	9463	June 1, 2009
bank conflict in cuda's parallel prefix scan GPU-Accelerated Libraries	1	1889	February 12, 2016
Cannot achieve max shared memory bandwith CUDA Programming and Performance	12	813	November 20, 2023
Another question about coalesced reads/writes CUDA Programming and Performance	10	2130	August 18, 2009

Bytes in shared memory

Related topics