If I put a character array in shared memory do I get the full write bandiwdth. Hardware can do 32 banks * 4 bytes per clock. However for 1 byte characters each consecutive 4 access goes to same bank. Will hardware combine the 4 and treat it as a 4 byte access ?
Yes, that is covered in the programming guide as well, similar to your last question.
I imagine, similar to your last question, you will argue whether “no bank conflicts” also implies maximum efficiency, supposing that there is some hidden undocumented addtional optimization that can be done by the compiler; I won’t be able to speak to that.
Specifically with respect to this statement:
“Will hardware combine the 4 and treat it as a 4 byte access ?”
The programming guide uses a character array as an example, and says:
“A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, for read accesses, the word is broadcast to the requesting threads (multiple words can be broadcast in a single transaction)”
Thanks txbob. My question was on the char writes note from programming guide.
"and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined). "
According to this for a code
extern shared char shared;
shared[threadId] = data
With in a warp, only 8 threads in a warp will actually do the write
In other words, HW or compiler will combine the addresses with in same 32-b word going to one bank.
Yes, the writes will be coalesced. Pay careful attention here, as the word address here means byte address
note the terminology:
"any address within the same 32-bit word "
So for each byte address, only one writing thread will win. However separate byte addresses written to within the same 32-bit word will be coalesced by the hardware, in the same way that all other forms of coalescing occur. (You could easily test this, right?)
Remember (except for atomics) the actual transaction size is always either 32 bytes or 128 bytes. The transactions are coalesced to one of these two sizes, per instruction, across a warp.
You’ve also mentioned the compiler a few times when talking about inter-thread behavior. I suggest this is a non-concept. The compiler works with the programming model of a thread, not across threads. The compiler knows nothing about the fact that there are other adjacent threads (or not).
In your example, I don’t agree with this claim:
“With in a warp, only 8 threads in a warp will actually do the write”
You could easily test this, right?)
Yes I tested and it does coalesce. What I cant say for sure is did compiler detect the sequentiality between threads and generated a specific STS instruction or HW did the coalescing.
The compiler knows nothing about the fact that there are other adjacent threads (or not).
Well threads in CUDA are just SIMD lanes which are 32 wide. Compiler will definitely look for optimizations that can be done across lanes. Uniform loads are best example.
The compiler has a single-thread view of the world, because at compile time, nothing can be known about the run-time configuration (blocks, threads) used to execute the code.
For example, the compiler will track data dependencies strictly based on matching up addresses, and knows nothing about implicit inter-thread data dependencies as they frequently occur in reductions. In the absence of explicitly expressed data or control dependencies, the compiler is free to re-arrange operations that have been established as independent based on code analysis.
In general, processing data one byte at a time is inefficient because it drives up the dynamic instruction count. That is true for any processor using registers that are more than one byte wide. SIMD-in-a-register techniques can often be used to speed-up byte-wise processing.
Right compiler doesn’t know run time block/threads config. However it knows 32 threads with in a warp and can do optimizations. What I’m talking about is optimizations with in a warp (SIMD32 lane). Not across warps.
Yes one byte at a time is inefficient. Hence they could be doing compiler coalsecing with in a warp or HW merging.
The compiler already knows how to vectorize loads and stores. However, since loads and stores have to be naturally aligned on GPUs (this is, or the most part, different from x86), it can only do so if it can prove proper alignment of the wider access. That is not always possible. Even with vectorized loads and stores you would still be looking at extra instructions to handle data one byte at a time.
I have worked on C-string libraries for x86 and SPARC, and that code operates on the data 4 bytes or 8 bytes at a time as much as possible. For image processing and gene sequencing, CUDA’s 32-bit byte-wise SIMD intrinsics often work well and are a perfect match for the 32-bit registers.
The magical do-what-I-mean optimization in compilers remains elusive. If you really care for performance, at some point you will have to write code accordingly.
I made the compiler generate STS.U8 (1 byte store per lane) instruction. This results in 1 SLM transaction which confirms HW is coalescing the bytes to a 32b chunk
If HW was not coalescing it would have resulted in 8 SLM transactions for a STS.U8 instruction.