I have that code: (I can’t add full code for some reason… limitation for code size?)
struct __declspec(align(32)) Circle
{
float x, y;
float prevX, prevY;
float speedX, speedY;
float mass;
float radius;
void init(const int _x, const int _y, const float _speedX = 0.0f, const float _speedY = 0.0f,
const float _radius = CIRCLE_RADIUS_DEFAULT,
const float _mass = CIRCLE_MASS_DEFAULT);
};
/*smem[threadIdx.x] = *(((float*)cOut) + threadIdx.x);
smem[threadIdx.x + blockDim.x] = *(((float*)cOut) + threadIdx.x + blockDim.x);
smem[threadIdx.x + blockDim.x * 2] = *(((float*)cOut) + threadIdx.x + blockDim.x * 2);
smem[threadIdx.x + blockDim.x * 3] = *(((float*)cOut) + threadIdx.x + blockDim.x * 3);
smem[threadIdx.x + blockDim.x * 4] = *(((float*)cOut) + threadIdx.x + blockDim.x * 4);
smem[threadIdx.x + blockDim.x * 5] = *(((float*)cOut) + threadIdx.x + blockDim.x * 5);
smem[threadIdx.x + blockDim.x * 6] = *(((float*)cOut) + threadIdx.x + blockDim.x * 6);
smem[threadIdx.x + blockDim.x * 7] = *(((float*)cOut) + threadIdx.x + blockDim.x * 7);*/
__syncthreads();
/*float x, y;
float prevX, prevY;
float speedX, speedY;
float mass;
float radius;*/
/*c.x = smem[threadIdx.x];
c.y = smem[threadIdx.x + blockDim.x]; //there must be [threadId.x * 8 + 0]
c.prevX = smem[threadIdx.x + blockDim.x * 2]; //[threadId.x * 8 + 1] and e.t.c.
c.prevY = smem[threadIdx.x + blockDim.x * 3];
c.speedX = smem[threadIdx.x + blockDim.x * 4];
c.speedY = smem[threadIdx.x + blockDim.x * 5];
c.mass = smem[threadIdx.x + blockDim.x * 6];
c.radius = smem[threadIdx.x + blockDim.x * 7];*/
c = cOut[j];
//c = *((Circle*)(smem + threadIdx * SMEM));
There is 2 gmem (I mean global memory) access:
- Read Circle and detect collisions with it
- Write Circle after changing it’s speed and position
Also I have circlesConst-massive of Circle, which was allocated by cudaMallocToSybol(). It is used to check intersection with its circles of the main circle C (it’s in the register), which was read from gmem.
As I think, I used const memory well and it gains me all its performance :') (Am I wrong?)
When I read about coalesced access to gmem (is there coalesced access to other types of memory? I didn’t find any info about it), I wanted to try it for me.
As you can see, Circle-structure has 8 vars typed float = 32bits. I tried (in code it is commented) to do it, but, firstly, I get a wrong answer (because I must read from smem not correctly, mentioned below), secondly, I get 33% performance less. Why? I think, it doesn’t depend on wrong fields relations.
And the second question, as I wrote in the comment in the code near the reading from smem to C, I must read another way, but If I do so, there will be a lot of banks conflict, so I will get much less performance…
So, how can I load Circles coalasced without bank conflicts and, after that, write it back?