Getting coalesced reads and stores using a struct

I’m trying to optimize my code, and In my head it made sense to use a struct, so that each thread could read in the entire struct, and they would all be adjacent to each other in the 1D array. The Visual Profiler however shows I have 0 coalesced reads/stores and all uncoalesced. I can’t wrap my head around why this isn’t working. I have a feeling it has something to do with padding the struct to an even multiple of 32 bits, but by my math 4 floats @ 8bytes = 32 bytes. I don’t know.

Struct.

typedef struct{

	float2 p;		//Position

	float2 v;		//Velocity

} particle;

Kernel with most of the code taken out.

__global__ void updateParticles( particle*particles)

{

	int i =  blockIdx.x * blockDim.x + threadIdx.x;

	particle p = particles[i];

	//Update Particle p.positions

	p.p.x += p.v.x;

	p.p.y += p.v.y;

...

	//Save results

	particles[i] = p;

}

While structs might be good for readability, using separate arrays for the members is better for performance exactly because of the problem with coalescing you are seeing. So if you are free to rearrange the variables, prefer (structs of) arrays to arrays of structs.

Btw., floats occupy 4 bytes. 8 byte floating point variables would be doubles. ;)

The other trick is to read the particles as a big raw memory array into a list in shared memory. This would be perfectly coalesced no matter what the structure size or content. Then each thread can load from shared memory. The thread reads might hit bank conflicts, but that’s not as big an inefficiency as bad coalescing. You also need to have the shared memory available for staging, and that can be a limitation.

And of course in Fermi, it doesn’t matter too much since it’s cached so the mis-coalescing isn’t as big a penalty anyway.

I switched to using 2 VBO’s went from about 37 fps to about 57. Thanks a ton!