Based on the shared memory hardware architecture, there are 32 banks, and each bank has a throughput of 4 bytes per cycle. This means the maximum shared memory throughput per cycle is 128 bytes.
Assuming my data type is cuDoubleComplex (16 bytes), this implies that at most 128 / 16 = 8 elements can be accessed per cycle. Accessing more than this number will inevitably lead to bank conflicts.
My question is: suppose I need to move 20 cuDoubleComplex elements from global memory to shared memory at once. There are two possible approaches:
Use 20 threads directly to move the data, which will cause bank conflicts.
Use only 8 threads and complete the transfer in 3 iterations.
Although the first method causes bank conflicts, it essentially just results in the instructions issued in parallel being serialized over multiple cycles. I’m curious whether, in practice, it is worthwhile to switch to the second method.
I’m currently conducting experiments, but I’d love to hear insights from experts on this matter.
I think the idea is adjacent/contiguous, which is generally a good pattern for global access (promotes “coalescing”) and also generally a good pattern for shared access (tends to minimize bank conflicts).
From what I have seen, people often refer to such a pattern using the word “coalesced”. The english-language word “coalesce” has a notion of grouping, or collecting together, which is a relevant concept here, even if CUDA tends to make a specific definition of it.
Perhaps something like this:
“adjacent/contiguous 16-byte loads (considered across threads in the warp) will be split into multiple conflict-free 128-byte transactions.”
I see what you mean.
From what I understand, the term “coalesced” is typically used in the context of global memory access, such as DRAM burst transactions, and as far as I know, shared memory doesn’t support coalescing in the same sense.
That said, I was wondering—
If I use 20 threads to copy coalesced cuDoubleComplex data from global memory into shared memory, I’m not entirely sure whether this would lead to bank conflicts on the shared memory side.
Is it generally possible to design an access pattern in shared memory that avoids bank conflicts in this kind of scenario?
————————
On a side note, before diving into coding, I usually try to think through issues like throughput, thread count, and whether bank conflicts might occur.
However, during the profiling stage, I often observe unexpected behaviors — for example, I might anticipate bank conflicts, but they don’t actually happen in practice.
Do you have any suggestions on this?
For instance, should I consider disabling certain compiler optimizations (e.g.,nvcc O3), or is it simply unrealistic to try and predict everything before writing code?
I agree with your viewpoint. However I have seen other usages (and have probably not been always perfectly consistent myself.) When communicating, it is good to be reasonably specific, but it is also good to put effort into interpreting others statements.
It won’t.
For a 16-byte per thread load from shared (LDS) or store to shared (STS), when each thread is loading data that is adjacent and contiguous, the GPU will not execute that warp-wide request in a single transaction. Notionally, considering all 32 threads in the warp, the way we can think about it is that the first 8 threads needs get grouped into a single transaction, the next 8 threads needs get grouped into a single transaction, and so on for the 3rd and 4th grouping of 8 threads.
Given that preamble, shared memory bank conflicts are assessed per transaction, not per request or any other grouping.
Given that preamble, loading 16-byte adjacent/contiguous quantities across only 8 or fewer threads will take place in a single transaction. 9-16 threads will require 2 transactions. 17-24 threads will require 3 transactions. 25 up to the full 32 threads will require 4 transactions.
So if you use 20 threads loading or storing shared data for example according to a pattern like this:
if (threadIdx.x < 20)
shared_data[threadIdx.x] = global_data[global_index]; // the shared data from thread 0 is adjacent/contguous with the data from thread 1, and so on
and the fundamental element type is a 16-byte type, then the first 8 threads will be grouped into a transaction, the next 8 threads will be grouped into a transaction, and the remaining 4 threads will occupy a 3rd transaction.
If you now study the access pattern implied for any one of those 3 transactions, you will find that none of them produce a bank conflicted pattern, when the access pattern is overlaid with the bank pattern.
I think that is a good mentality/methodology. There are a few basic concepts that I think are “top-level” concepts for good CUDA code performance that I believe should be part of the CUDA programmers thought patterns when designing code. Roughly speaking, those basic concepts (there are really just 2) are covered in units 1-4 of this online training series. Those two basic concepts that every CUDA programmer should address from a design standpoint are:
Expose enough parallelism for the GPU (roughly: launch kernels with a large number of threads)
Make efficient use of the memory subsystems (primarily global and shared).
Other optimization questions/concepts (example: should I aggressively try to convert these particular double calculations to float, even if it requires some effort and analysis) are topics that I would generally advise folks to leave until the profiler has indicated to me that those are indeed performance limiters to my code.
All by itself, I might perhaps not give that much attention. First of all, there is a general (IMO) hyper-focus on bank conflicts. Sometimes they are difficult to avoid, or may not be an actual performance issue. Use judgment, taking into account what else the profiler is telling you, besides just the existence or not of bank conflicts in a particular section of code. Please don’t misunderstand; your question about whether or not a piece of code should or should not have bank conflicts is useful/good. This is how we learn, to some degree. But just because I spend time answering it doesn’t mean that I think you should treat bank conflicts as if it were an impending global pandemic.
It is absolutely unrealistic to try and predict everything before writing code. I do suggest programmers generally try to address the two top-level CUDA optimization priorities that I already mentioned when writing code, but even there a thoughtful programmer may realize that a particular section of code is either intractable or not likely to be a performance issue anyway, so ignore the basic tenets.
nvcc -O3 is not a device code optimization directive. It is an instruction to the host code compiler. The thing that would impact optimization level of device code is -Xptxas=-O3 or similar. I would basically suggest that nobody ever use a code switch like that, and sleep very comfortably at night. The only time I can imagine using -Xptxas=-O0 (for example) is as a diagnostic, when I already have a bug that I am chasing, and I want to try to see if the bug is affected by some aspect of the toolchain. Of course there are exceptions to every rule. I can’t at the moment think of an exception where I would request to the compiler to not fully optimize device code, when doing performance analysis.
Naturally, when we get to this level of discussion, a lot of what I am expressing is my opinion. Others may have different viewpoints as to what items should be addressed a-priori from a design perspective, vs. what items can/should be left to the point of performance analysis/profiling.
Thank you for sharing your insightful experience and thoughtful response.
Your replies are always worth reading multiple times and reflecting on deeply. :)
If you transfer cuDoubleComplex with a type size of 16, you need multiples (due to possible bank conflicts) of 4 cycles (due to element type size), so a minimum of 1x4=4 cycles.
As you calculated, the optimal minimum due to shared memory bandwidth is 3 cycles.
This can only be achieved with 4 byte accesses. You have to split the cuDoubleComplex into 4 unsigned int values. So you get 80 unsigned int. You can transfer 32 of those per cycle.
Now your 2 variants compared: If you use less than 32 threads, the overall bandwidth is lowered! So 8 threads=lanes doing a 16 byte transaction (without bank conflicts) need 4 cycles, not 1 cycle!
Using less threads only makes it faster, if you avoid bank conflicts by it. And then the bank conflicts perhaps could have been avoided by more clever indexing instead.
For even fuller optimization 2.4 vs 3 cycles:
Can you combine 2 or 5 times as many transactions of 20 cuComplexDouble (40 or 100)? Then you can further reduce the average cycle time down from 3 to 2.5 or to 2.4.
In extension to Robert’s post/approach, it is good to have a general mental model, what operations are slowest or most contested - in rough order from should be optimized first to last: host transfers over PCIe, global memory accesses, shared memory accesses, arithmetic computations.
The presented results could slightly change, if you do not care for thread cycles, but for number of 32-byte sectors transferred (as is important for L2 or global memory bandwidth).
Then 8 threads reading 16 bytes each, could indeed need between 4 to 8 32-byte sectors. Depending on whether they are coalesced in a pairwise fashion.
But in that case, you could also have a reduced need for transfers from L2, if you can serve from L1 cache.
So your question would be no longer about single isolated instructions.