Coalesced Memory Access to Structs

I wanted to optimize my application, so I used the profiler to get some information especially about memory access.

I saw that there was many Uncoalesced gld, so I read the Programming guide and many other tutorials.

They all said, that a memory access is coalesced if

  • we access 4B/8B/16B

  • the memory is aligned, so that next thread accesses next memory block (so there is no space between the blocks)

  • first thread accesses N*64B

so could anybody tell me what is wrong with my code (or my mind)?

struct testStruct 

{

	int x;

	int y;

	ON_CUDA testStruct(){}

	ON_CUDA testStruct(int x, int y = 0, int z = 0) : x(x), y(y){}

};

__global__ static void testScale(testStruct *ret0) {

	const int bid = blockIdx.x;	

	const int tid = threadIdx.x;	

	testStruct ts(ret0 [bid * blockDim.x + tid]);

	int ret = tid * bid + tid - bid + ts.x;

	for (int i = 1; i < 1; i++) {

		for (int j = i; j < 1; j++) {

			ret += i;

			ret *= j;

			ret %=99991;

			ret++;

			ret <<= 1;

		}

	}

	testStruct retS(ret, 0);

	ret0[bid * blockDim.x + tid] = retS;

}

profiler says:

gld uncoalesced: 26880

gst uncoalesced: 107520

with testScale<<<105, 256>>>

I think the alignment is no problem and data size seems to be okay with 2x4bytes

I am using WinXP32 on 8600M GS with cuda 2.3

I think that the problem you are having is that the compiler does not generate vector load/store instructions for accesses to structs:

for example this line:

testStruct ts(ret0 [bid * blockDim.x + tid]);

gets compiled into these instructions:

cvt.s32.u16 	%r1, %ctaid.x;

	cvt.u32.u16 	%r2, %ntid.x;

	mul.lo.u32 	%r3, %r1, %r2;

	cvt.s32.u16 	%r4, %tid.x;

	add.u32 	%r5, %r4, %r3;

	mul.lo.u32 	%r6, %r5, 8;

	ld.param.u32 	%r7, [__cudaparm__Z9testScaleP10testStruct_ret0];

	add.u32 	%r8, %r7, %r6;

	ld.global.s32 	%r9, [%r8+0];

The final instruction will do a 32-bit load at a 64-bit stride. So if the base of the array is at 0x0, thread 0 will load from 0x0-0x4, thread 1 will load from 0x8-0xC, etc. Notice that you are skipping 0x4-0x8, 0xC-0x10, etc. This results in an uncoalesced acceess. This is annoying from the perspective of a compiler writer since the C standard specifies that you have to lay out adjacent members of a struct sequentially in memory. There is nothing that the compiler can do in this case.

The second case is more interesting, the line:

ret0[bid * blockDim.x + tid] = retS;

gets compiled into:

mul.lo.s32 	%r10, %r4, %r1;

	add.s32 	%r11, %r4, %r10;

	sub.s32 	%r12, %r11, %r1;

	add.s32 	%r13, %r9, %r12;

	st.global.s32 	[%r8+0], %r13;

	mov.s32 	%r14, 0;

	st.global.s32 	[%r8+4], %r14;

Notice that the compiler issues two stores here, with the same access pattern as the first case. This could have been better optimized by doing a vector store. Further, in the general case, the compiler should be able to tell from the control structure of the program that all threads will store to the structure, which could allow it to a very aggressive optimization by making all of the threads cooperatively store the structures using purely coalesced stores.

Lacking an omniscient compiler, you can do either of these optimizations yourself, or use a structure of arrays rather than an array of structures instead.

Thx for your fast reply Gregory,

I hope, that I understood you right, that the compiler is the problem.

What compiler is that? Is it the nvcc or is it the gcc/windows cc compiler (i think it should be the gcc)?

And do you know any possibility to tell the compiler by flag or within c++ code to do a vector store? (because touching the compiled code seems not to be a clean solution in software development)

How can a 8B/16B storing be done (as it is described in programming guide)?

I even tried it with casting the struct to char[8] and casting it back after loading to local memory.

Maybe a SoA is not that bad idea, but changing code to fit to the compiler makes me get little headache, because i know that the hardware should be able to do handle an AoS in that case of small structs (And an SoA should be compiled to 2 x N (x4B) instead of 1x N (x8B) memory accesses).

Thx 4 helping

Not the compiler is the problem but the language specification. As already said: the solution is to not use structs but instead pack the data into several arrays.

This code is produced by nvcc.

No, I am not aware of any flag that will force the compiler to do this for you. It should be possible to rewrite your code (not the assembly but the actual CUDA code), to do this manually (use int2 as the basic data type in your class and I think that the compiler will issue vector loads). This will only work for structs whose total size is <= 8 bytes.

This is a problem with how data is laid out in memory. To get the most performance out of your code, you need multiple threads to cooperate in the memory operations. For code like this, I usually will declare a buffer in shared memory and then I have a fast memory copy routine that works on plain-old-data to move a number of structs into shared memory. Once this is done, each thread can copy out of shared memory.

I know, I personally don’t like reorganizing my data just for cuda applications. I am hoping that future gpus will add hardware support to address this…

That was the hint :)

But unfortunately its a bit complicated only to have a pair/triple/quadruplet of ints, but you are right, with those type the compiler likes to do (coalesced) vector load…

struct testStruct {

	int x;

	int y;

	ON_CUDA testStruct(){}

	ON_CUDA testStruct(int x, int y = 0, int z = 0) : x(x), y(y){}

	ON_CUDA static inline const testStruct &cast(const int2 &from) {

		const int2* p = &from;

		const testStruct *p2 = (const testStruct*) p;

		return *p2;

	}

	ON_CUDA static inline const testStruct &accessArray(const testStruct *theArray, unsigned index) {

		int2 *alignedPointer = (int2 *) theArray;

		int2 temp1 = alignedPointer[index];

		return cast(temp1);

	}

};

__global__ static void testScale(testStruct *ret0) {

	const int bid = blockIdx.x;	

	const int tid = threadIdx.x;

	__shared__ testStruct ts[256];

	{

		ts[tid] = testStruct::accessArray(ret0, bid * blockDim.x + tid);

	}

	__syncthreads();

	int ret = tid * bid + tid - bid + ts[tid].x;

	for (int i = 1; i < 1; i++) {

		for (int j = i; j < 1; j++) {

			ret += i;

			ret *= j;

			ret %=99991;

			ret++;

			ret <<= 1;

		}

	}

	__syncthreads();

	{

		int2 *nicePtr = (int2*) ret0;

		testStruct retVal(ret, 0);

		int2 temp1 = *((int2*)(&retVal));

		nicePtr[bid * blockDim.x + tid] = temp1;

	}	

}

where ON_CUDA is defined as device host

Many thx!

I’m not good with C++ so I don’t know quite what your code is doing, but the compiler likes to “optimize” reads/writes involving entire structures to read/write only a subset of the fields (breaking coalescing). The solution to that is to mark your local variable or global memory area as volatile, then the compiler leaves it alone.

Oh yeah, if you do the volatile thing, the compiler will insist that you define an assignment operator or copy constructor or whatever (again, not good with C++) for the type “volatile testStruct”, so you would need to do that also.

Has this issue with compiler breaking coalescing been reported as a bug to nV?

They know about it since Simon Green mentioned it in this thread. I don’t think they consider it to be a bug.

Thx 4 the help, but it does not work for me (Still the same result and compiler did not asked for copyConstructor or assignment operator).

Maybe its a question of used compiler.

What cuda version are you using and what OS?

2.3 and 32 bit Windows XP, but I don’t think that’s relevant. I should amend my previous statement to “the compiler may insist…” - it only does it sometimes, I have no clue when or why. Probably the way you did it caused it not to be required for some reason.

I looked at your code more carefully and your fundamental problem is that you did not align(8) your structure. Replace “struct testStruct” with “struct align(8) testStruct”, this is required for the compiler to issue vector memory operations. Once you do that, you are eligible for vector memory operations. Looking at the .ptx, the write becomes a vector memory operation but the read doesn’t because it gets “optimized” down to a single 32-bit read in the way that I discussed earlier. Applying the volatile trick there solves that, and then you should be good to go for coalescing.