float4 Shared memory doesn't yield bank conflict according to nvprof when it should

hmwulfi · January 27, 2017, 6:19pm

Hello,

as the title says I’ve run into this problem.

I am implementing an n-body simulation.

Here’s the relevant code:
https://gist.github.com/wulfihm/47660970572343d9ceb60e812ff0455b
blockDim.x is 128, as is THREADS_PER_BLOCK.

(If you want to see the rest, here: GitHub - hwulfmeyer/N-Body-Simulation: small project for a n-body simulation for cuda & single threaded
But shouldn’t be necessary.)

The code in line 13-16 should result in bank conflicts but when I am profiling with nvprof “nvprof --events shared_st_bank_conflict ./nbodysim.exe” it tells me that I don’t have any.
You think I should be happy about that but it’s driving me mad because on my laptop it tells me I have bank conflicts.

My desktop pc uses a GTX 960 (CC 5.2), my laptop NVS4200m (CC 2.1). Both with CUDA 8.0.

Why do I think there are bank conflicts?
Float4 is 16bytes, so one float4 should reside in the first 4 banks (1 bank has 4byte).
So smPos[i].x, smPos[i+8].x, smPos[i+16].x, … and so on should all reside in the same bank. For 32 warps that makes a 8(I think) way bank conflict. I am accessing x,y,z,w, that makes 4*8 way bank conflicts per 32 warps per thread.

So I thought, maybe nvprof is erroneously reporting 0 bank conflicts and I tried to implement code that wouldn’t produce bank conflicts.

struct float4pad{
   float x,y,z,w,pad;
};

That should fix it when I use this for my shared memory variable. And yes, nvprof still tells me 0 bank conflicts as expected. The speed also roughly stays the same, so nvprof is not faulty.

Funnily enough when I use a struct with float4 like so:

struct float4pad{
   float4 body;
   float pad;
};

I get bank conflicts.

So it seems float4 has a different behaviour regarding memory than normal floats. One explanation would be that float4’s are all saved into the same bank and thus we don’t have bank conflicts since we have linear accessing.

So what is happening with float4 on the cuda device? Strangely I couldn’t find an answer anywhere. Since I can observe a different behavior regarding float4 on my laptop with and older card I think it has to do with the compute capability and a feature for float4 probably.

Thanks in advance!

Robert_Crovella · February 2, 2017, 4:07am

you may wish to start by reading the relevant section of the programming guide

128 bit access per thread on a cc2.x device is specifically called out as having at least 2-way bank conflict. For higher devices, the only description given is that the number of bank conflicts should be the same or less.

The reason it can be less is that for 64-bit or 128-bit access per thread, the bank conflict calculation degrades to considerations of what is going on at the half-warp level or quarter-warp level, as covered in the documentation.

I’m not suggesting a complete functional spec for shared access behavior is given in the documentation, but there is enough there to show that your observations are reasonable.

And with respect to the varying structure organizations you have given, it’s probably necessary to analyze the SASS to determine how the compiler has converted (or not) the various flavors to underlying 64- or 128-bit transactions at the thread level.

fschutze0 · January 13, 2024, 11:30am

Thanks for the info! Where’s that:

128 bit access per thread on a cc2.x device is specifically called out as having at least 2-way bank conflict. For higher devices, the only description given is that the number of bank conflicts should be the same or less.

Exactly documented? I couldn’t find it going through the programming manual.

rs277 · January 13, 2024, 6:03pm

Current documentation seems to only now go back as far as CC5.0.

The information you quote is contained in the Programming Guide for Cuda 8:

" 128-Bit Accesses: The majority of 128-bit accesses will cause 2-way bank conflicts, even if no two threads in a quarter-warp access different addresses belonging to the same bank. Therefore, to determine the ways of bank conflicts, one must add 1 to the maximum number of threads in a quarter-warp that access different addresses belonging to the same bank."

fschutze0 · January 13, 2024, 6:15pm

Thanks for the info, @rs277 .

Topic		Replies	Views
Bank Conflicts CUDA Programming and Performance	2	1962	December 6, 2009
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	10028	January 16, 2025
Shared memory bank conflict CUDA Programming and Performance	1	285	May 19, 2024
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2630	March 31, 2010
shared memory bank conflicts cc 2.0 CUDA Programming and Performance	3	893	December 29, 2011
Cannot achieve max shared memory bandwith CUDA Programming and Performance	12	816	November 20, 2023
128-bit access bank conflict CUDA Programming and Performance	11	957	March 29, 2024
Non-ideal shared memory transactions? CUDA Programming and Performance	7	1373	February 7, 2015
bank conflicts...? CUDA Programming and Performance	2	3321	January 30, 2008
confusion about 64 bit shared memory access CUDA Programming and Performance	1	1251	May 10, 2012

float4 Shared memory doesn't yield bank conflict according to nvprof when it should

Related topics