Aligned structures problem

Hi,

I’m having a problem to get coalesced reads/writes using a structure of many floats. An example code is presented below:

#define NUM_THREADS_PER_BLOCK 256

#define NUM_BLOCKS 84 

#define NUM_THREADS 21504

#include <stdio.h>

#include <cutil.h>

typedef struct __align__(16)

{

	float a;

	float b;

	float c;

	float d;

	float e;

	float f;

	float g;

	//float h;

}Structure;

__global__ void Launch_Global(Structure* pd)

{

	int bx=blockIdx.x;

	int tx=threadIdx.x;	

	int begin=NUM_THREADS_PER_BLOCK*bx;

	Structure p;

	p.a=0.0f;

	p.b=0.0f;

	p.c=0.0f;

	p.d=0.0f;

	p.e=0.0f;

	p.f=1.0f;

	p.g=0.0f;

	pd[begin+tx]=p;

}

int main(int argc, char* argv[])

{

	cudaError_t cudastat;

	Structure* pd;

	cudaMalloc((void**)&pd,NUM_THREADS*sizeof(Structure));

	dim3 dimBlock(NUM_THREADS_PER_BLOCK);

	dim3 dimGrid(NUM_BLOCKS);

	cudaThreadSynchronize();

	Launch_Global<<<dimGrid,dimBlock>>>(pd);

	cudaThreadSynchronize();

	cudastat=cudaGetLastError();

	printf("\nError code=%i\n",cudastat);

	printf("Error code=%i, %s.\n",cudastat,cudaGetErrorString(cudastat));

	cudaFree(pd);

	return 0;

}

The PTX code translates the global memory write to one st.global.v4.f32, one st.global.v2.f32 and st.global.f32 and the Profiler says the writes are incoherent. Also, sizeof(Structure) says the structure is 32 bytes. If I add one more float to my structure the structure size is still 32 bytes and the PTX code features two st.global.v4.f32, as it should but the profiler still reports the writes as incoherent. What have I missed?

I’m running the beta 2.0a SDK and the 2.0a Toolkit (32 bit XP), using a 8800GT.

Yet another interesting thing that may or may not be related. When I run the alignedTypes example program, the test fails for the LA32, RGB32, RGBA32 and RGBA32_2 tests.

Does anybody have a clue what’s going on?

Brief update.

I’ve updated the SDK Toolkit and the Profiler to the latest versions and the problem remains. I do however suddenly pass all the alignedTypes tests.