I have a few questions regarding shared memory bank conflicts:
In the older architectures, the shared memory had 16 banks. The latest architectures have 32 banks. Does that mean that code written for the newer versions needs to be changed to take into account change in the number of banks?
How does this play with the fact that in the older version 32 threads in a warp accessed the shared memory at 16 thread granularity. In the latest architectures there are 32 banks and each of the 32 threads accesses the 32 banked memory simultaneously?
How much impact does shared memory bank conflicts have on performance of real applications?
Has there been any work on quantifying the impact of bank conflicts on performance on several real world applications?
From my experience, it depends…
For instance, in the transposition exemple from Mark Harris, you still need to have a 1716 memory block.
However, a 162 bloc that would have been transformed to a 17*2 bloc in 1.x device doesn’t need to be changed anymore.
Why do we not need to pad by 1? I am not sure if I understand the reason. (is it because of the cache?)
A multiprocessors have 16 memory access units. Hence, I suppose that the accesses still use the 16 threads granularity.
In my experience, very few real application still need to use the shared memory explicitely as the cache does the job very well.
I now only use shared memory for synchronised communications among the threads.
The cache does reduce the burden on the programmer. But can it really give as good a performance as manually managing the shared
memory? If the access patterns are regular then I guess you get equivalent performance but for irregular accesses which caused
bank conflicts, the cache does not buy you anything. I guess the cache is also banked similarly and hence also suffers from bank
conflicts. What is your opinion on this?
I havent seen any paper of this kind recently… (In fact, not since 1.0 devices)
I do not know the cache algorithm… I suppose that there are some kind of bank conflicts but I think that they are not predictible. However, I suppose that as on CPU, there may be conflicts beetwen global memory locations.
Mainly the transpose SDK example.
It was about 10% for the kernel.
Usually, in a real application, this kind of kernel does not consume much time.