Shared Memory Bank Conflicts

floopydrive · February 23, 2012, 8:05pm

Hi Everyone,

I have a few questions regarding shared memory bank conflicts:

In the older architectures, the shared memory had 16 banks. The latest architectures have 32 banks. Does that mean that code written for the newer versions needs to be changed to take into account change in the number of banks?
How does this play with the fact that in the older version 32 threads in a warp accessed the shared memory at 16 thread granularity. In the latest architectures there are 32 banks and each of the 32 threads accesses the 32 banked memory simultaneously?
How much impact does shared memory bank conflicts have on performance of real applications?
Has there been any work on quantifying the impact of bank conflicts on performance on several real world applications?

thanks
Manish

Wilfried_K · February 24, 2012, 9:43am

From my experience, it depends…

For instance, in the transposition exemple from Mark Harris, you still need to have a 17*16 memory block.

However, a 162 bloc that would have been transformed to a 172 bloc in 1.x device doesn’t need to be changed anymore.

A multiprocessors have 16 memory access units. Hence, I suppose that the accesses still use the 16 threads granularity.

In my experience, very few real application still need to use the shared memory explicitely as the cache does the job very well.

I now only use shared memory for synchronised communications among the threads.

I havent seen any paper of this kind recently… (In fact, not since 1.0 devices)

Wilfried K.

floopydrive · February 24, 2012, 10:12am

From my experience, it depends…
For instance, in the transposition exemple from Mark Harris, you still need to have a 1716 memory block.
However, a 162 bloc that would have been transformed to a 17*2 bloc in 1.x device doesn’t need to be changed anymore.

Why do we not need to pad by 1? I am not sure if I understand the reason. (is it because of the cache?)

A multiprocessors have 16 memory access units. Hence, I suppose that the accesses still use the 16 threads granularity.

Okay.

In my experience, very few real application still need to use the shared memory explicitely as the cache does the job very well.
I now only use shared memory for synchronised communications among the threads.

The cache does reduce the burden on the programmer. But can it really give as good a performance as manually managing the shared
memory? If the access patterns are regular then I guess you get equivalent performance but for irregular accesses which caused
bank conflicts, the cache does not buy you anything. I guess the cache is also banked similarly and hence also suffers from bank
conflicts. What is your opinion on this?

I havent seen any paper of this kind recently… (In fact, not since 1.0 devices)

Can you point me to any old work on this?

thanks a lot!

Wilfried_K · February 24, 2012, 10:58am

To avoid the bank conflicts

The documentation of the Transpose SDK example explains this all.

You can refer to older SDK versions to have more precise explanations for old devices.

You can also see http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf (the final explanation for shared memory bank conflicts is on the last page)

I do not know the cache algorithm… I suppose that there are some kind of bank conflicts but I think that they are not predictible. However, I suppose that as on CPU, there may be conflicts beetwen global memory locations.

Mainly the transpose SDK example.

It was about 10% for the kernel.

Usually, in a real application, this kind of kernel does not consume much time.

For an example in Neutron physics transport : http://www.metz.supelec.fr/metz/recherche/publis_pdf/Supelec505.pdf

Topic		Replies	Views
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	11885	January 16, 2025
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3502	August 20, 2009
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6700	February 8, 2009
Shared memory bank conflict CUDA Programming and Performance	4	399	July 30, 2025
Shared memory bank conflicts CUDA Programming and Performance	1	2399	August 24, 2009
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	4474	January 23, 2024
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2664	March 31, 2010
Shared memory bank conflict CUDA Programming and Performance	2	1610	October 1, 2021
Bank conflicts in shared memory still a thing? CUDA Programming and Performance	4	900	June 27, 2014
Bank Conflict when each thread accesses 2 elements CUDA Programming and Performance	8	5608	July 9, 2010

Shared Memory Bank Conflicts

Related topics