Hello,
as the title says I’ve run into this problem.
I am implementing an n-body simulation.
Here’s the relevant code:
https://gist.github.com/wulfihm/47660970572343d9ceb60e812ff0455b
blockDim.x is 128, as is THREADS_PER_BLOCK.
(If you want to see the rest, here: GitHub - hwulfmeyer/N-Body-Simulation: small project for a n-body simulation for cuda & single threaded
But shouldn’t be necessary.)
The code in line 13-16 should result in bank conflicts but when I am profiling with nvprof “nvprof --events shared_st_bank_conflict ./nbodysim.exe” it tells me that I don’t have any.
You think I should be happy about that but it’s driving me mad because on my laptop it tells me I have bank conflicts.
My desktop pc uses a GTX 960 (CC 5.2), my laptop NVS4200m (CC 2.1). Both with CUDA 8.0.
Why do I think there are bank conflicts?
Float4 is 16bytes, so one float4 should reside in the first 4 banks (1 bank has 4byte).
So smPos[i].x, smPos[i+8].x, smPos[i+16].x, … and so on should all reside in the same bank. For 32 warps that makes a 8(I think) way bank conflict. I am accessing x,y,z,w, that makes 4*8 way bank conflicts per 32 warps per thread.
So I thought, maybe nvprof is erroneously reporting 0 bank conflicts and I tried to implement code that wouldn’t produce bank conflicts.
struct float4pad{
float x,y,z,w,pad;
};
That should fix it when I use this for my shared memory variable. And yes, nvprof still tells me 0 bank conflicts as expected. The speed also roughly stays the same, so nvprof is not faulty.
Funnily enough when I use a struct with float4 like so:
struct float4pad{
float4 body;
float pad;
};
I get bank conflicts.
So it seems float4 has a different behaviour regarding memory than normal floats. One explanation would be that float4’s are all saved into the same bank and thus we don’t have bank conflicts since we have linear accessing.
So what is happening with float4 on the cuda device? Strangely I couldn’t find an answer anywhere. Since I can observe a different behavior regarding float4 on my laptop with and older card I think it has to do with the compute capability and a feature for float4 probably.
Thanks in advance!