float4 Shared memory doesn't yield bank conflict according to nvprof when it should


as the title says I’ve run into this problem.

I am implementing an n-body simulation.

Here’s the relevant code:
blockDim.x is 128, as is THREADS_PER_BLOCK.

(If you want to see the rest, here: https://github.com/wulfihm/N-Body-Simulation
But shouldn’t be necessary.)

The code in line 13-16 should result in bank conflicts but when I am profiling with nvprof “nvprof --events shared_st_bank_conflict ./nbodysim.exe” it tells me that I don’t have any.
You think I should be happy about that but it’s driving me mad because on my laptop it tells me I have bank conflicts.

My desktop pc uses a GTX 960 (CC 5.2), my laptop NVS4200m (CC 2.1). Both with CUDA 8.0.

Why do I think there are bank conflicts?
Float4 is 16bytes, so one float4 should reside in the first 4 banks (1 bank has 4byte).
So smPos[i].x, smPos[i+8].x, smPos[i+16].x, … and so on should all reside in the same bank. For 32 warps that makes a 8(I think) way bank conflict. I am accessing x,y,z,w, that makes 4*8 way bank conflicts per 32 warps per thread.

So I thought, maybe nvprof is erroneously reporting 0 bank conflicts and I tried to implement code that wouldn’t produce bank conflicts.

struct float4pad{
   float x,y,z,w,pad;

That should fix it when I use this for my shared memory variable. And yes, nvprof still tells me 0 bank conflicts as expected. The speed also roughly stays the same, so nvprof is not faulty.

Funnily enough when I use a struct with float4 like so:

struct float4pad{
   float4 body;
   float pad;

I get bank conflicts.

So it seems float4 has a different behaviour regarding memory than normal floats. One explanation would be that float4’s are all saved into the same bank and thus we don’t have bank conflicts since we have linear accessing.

So what is happening with float4 on the cuda device? Strangely I couldn’t find an answer anywhere. Since I can observe a different behavior regarding float4 on my laptop with and older card I think it has to do with the compute capability and a feature for float4 probably.

Thanks in advance!

you may wish to start by reading the relevant section of the programming guide

128 bit access per thread on a cc2.x device is specifically called out as having at least 2-way bank conflict. For higher devices, the only description given is that the number of bank conflicts should be the same or less.

The reason it can be less is that for 64-bit or 128-bit access per thread, the bank conflict calculation degrades to considerations of what is going on at the half-warp level or quarter-warp level, as covered in the documentation.

I’m not suggesting a complete functional spec for shared access behavior is given in the documentation, but there is enough there to show that your observations are reasonable.

And with respect to the varying structure organizations you have given, it’s probably necessary to analyze the SASS to determine how the compiler has converted (or not) the various flavors to underlying 64- or 128-bit transactions at the thread level.