Collision solving (try to coalesce gmem access, using smem, but banks conflicts)

I have that code: (I can’t add full code for some reason… limitation for code size?)

struct __declspec(align(32)) Circle
{
	float x, y;
	float prevX, prevY;
	float speedX, speedY;
	float mass;
	float radius;

	void init(const int _x, const int _y, const float _speedX = 0.0f, const float _speedY = 0.0f,
		const float _radius = CIRCLE_RADIUS_DEFAULT, 
		const float _mass = CIRCLE_MASS_DEFAULT);
};
/*smem[threadIdx.x] = *(((float*)cOut) + threadIdx.x);
		smem[threadIdx.x + blockDim.x] = *(((float*)cOut) + threadIdx.x + blockDim.x);
		smem[threadIdx.x + blockDim.x * 2] = *(((float*)cOut) + threadIdx.x + blockDim.x * 2);
		smem[threadIdx.x + blockDim.x * 3] = *(((float*)cOut) + threadIdx.x + blockDim.x * 3);
		smem[threadIdx.x + blockDim.x * 4] = *(((float*)cOut) + threadIdx.x + blockDim.x * 4);
		smem[threadIdx.x + blockDim.x * 5] = *(((float*)cOut) + threadIdx.x + blockDim.x * 5);
		smem[threadIdx.x + blockDim.x * 6] = *(((float*)cOut) + threadIdx.x + blockDim.x * 6);
		smem[threadIdx.x + blockDim.x * 7] = *(((float*)cOut) + threadIdx.x + blockDim.x * 7);*/
		__syncthreads();
		/*float x, y;
		float prevX, prevY;
		float speedX, speedY;
		float mass;
		float radius;*/
		/*c.x = smem[threadIdx.x];
		c.y = smem[threadIdx.x + blockDim.x]; //there must be [threadId.x * 8 + 0]
		c.prevX = smem[threadIdx.x + blockDim.x * 2]; //[threadId.x * 8 + 1] and e.t.c.
		c.prevY = smem[threadIdx.x + blockDim.x * 3];
		c.speedX = smem[threadIdx.x + blockDim.x * 4];
		c.speedY = smem[threadIdx.x + blockDim.x * 5];
		c.mass = smem[threadIdx.x + blockDim.x * 6];
		c.radius = smem[threadIdx.x + blockDim.x * 7];*/
		c = cOut[j];
		//c = *((Circle*)(smem + threadIdx * SMEM));

There is 2 gmem (I mean global memory) access:

  1. Read Circle and detect collisions with it
  2. Write Circle after changing it’s speed and position
    Also I have circlesConst-massive of Circle, which was allocated by cudaMallocToSybol(). It is used to check intersection with its circles of the main circle C (it’s in the register), which was read from gmem.

As I think, I used const memory well and it gains me all its performance :’) (Am I wrong?)

When I read about coalesced access to gmem (is there coalesced access to other types of memory? I didn’t find any info about it), I wanted to try it for me.
As you can see, Circle-structure has 8 vars typed float = 32bits. I tried (in code it is commented) to do it, but, firstly, I get a wrong answer (because I must read from smem not correctly, mentioned below), secondly, I get 33% performance less. Why? I think, it doesn’t depend on wrong fields relations.

And the second question, as I wrote in the comment in the code near the reading from smem to C, I must read another way, but If I do so, there will be a lot of banks conflict, so I will get much less performance…
So, how can I load Circles coalasced without bank conflicts and, after that, write it back?