Alignment requirements

So I have some optimized device-to-device memcpy routines along the lines of this:

template< MemcpyType type >

	__device__ void _memcpy(void* _destination, 

		const void* _source, size_t length)

	{

		int index;

		int stride;

		

		char* destination = (char*) _destination;

		char* source = (char*) _source;

		

		switch( type )

		{

			case MemcpyWarp:

			{

				index =	INTRA_WARP_ID();

				stride = WARP_SIZE();

				break;

			}

			case MemcpyCta:

			{

				index = THREAD_ID();

				stride = CTA_DIMENSION();

				break;

			}

			case MemcpyBase:

			{

				index = GLOBAL_ID();

				stride = TOTAL_THREADS();

				break;

			}

		}

		

		if( (size_t)destination % sizeof(uint2) != 0 

			|| (size_t)source % sizeof(uint2) != 0 )

		{

			for( unsigned int i = index; i < length; i += stride )

			{

				destination[i] = source[i];

			}

		}

		else

		{

			int steps = length/sizeof(int2);

			int doubleStride = stride * 2;

			int i;

		

			// Transfer bulk

			for(i= 0; i < steps - doubleStride; i += doubleStride )

			{

				int2 tempA = ((int2*)source)[ i + index + stride * 0 ];

				int2 tempB = ((int2*)source)[ i + index + stride * 1 ];

				((int2*)destination)[ i + index + stride * 0 ] = tempA;

				((int2*)destination)[ i + index + stride * 1 ] = tempB;

			}

			// Transfer remainder

			for(   ; i< steps; i += stride)

			{

				if( (i + index) < steps )

				{

					((int2*)destination)[ i + index ] = 

						((int2*)source)[ i + index ];

				}

			}

		

			// Transfer last few bytes

			for(i= length - length % sizeof(int2); i< length; i++)

			{

				destination[ i ] = source[ i ];

			}

		}

	}

I find that this code give incorrect results if either _source or _destination is not aligned to an 8 byte boundary. It seems like the GPU silently aligns all of these accesses to 8-byte boundaries. Adding the following correction code fixes the problem, but I am wondering why this is necessary.

template< MemcpyType type >

	__device__ void _memcpy(void* _destination, 

		const void* _source, size_t length)

	{

		int index;

		int stride;

		

		char* destination = (char*) _destination;

		char* source = (char*) _source;

		

		switch( type )

		{

			case MemcpyWarp:

			{

				index =	INTRA_WARP_ID();

				stride = WARP_SIZE();

				break;

			}

			case MemcpyCta:

			{

				index = THREAD_ID();

				stride = CTA_DIMENSION();

				break;

			}

			case MemcpyBase:

			{

				index = GLOBAL_ID();

				stride = TOTAL_THREADS();

				break;

			}

		}

		

		if( (size_t)destination % sizeof(uint2) != 0 

			|| (size_t)source % sizeof(uint2) != 0 )

		{

			for( unsigned int i = index; i < length; i += stride )

			{

				destination[i] = source[i];

			}

		}

		else

		{

			int steps = length/sizeof(int2);

			int doubleStride = stride * 2;

			int i;

		

			// Transfer bulk

			for(i= 0; i < steps - doubleStride; i += doubleStride )

			{

				int2 tempA = ((int2*)source)[ i + index + stride * 0 ];

				int2 tempB = ((int2*)source)[ i + index + stride * 1 ];

				((int2*)destination)[ i + index + stride * 0 ] = tempA;

				((int2*)destination)[ i + index + stride * 1 ] = tempB;

			}

			// Transfer remainder

			for(   ; i< steps; i += stride)

			{

				if( (i + index) < steps )

				{

					((int2*)destination)[ i + index ] = 

						((int2*)source)[ i + index ];

				}

			}

		

			// Transfer last few bytes

			for(i= length - length % sizeof(int2); i< length; i++)

			{

				destination[ i ] = source[ i ];

			}

		}

	}

Any ideas? I would rather the GPU would throw an error here rather than silenty loading/storing the wrong data…

Interesting. This seems a reasonable behavior (the other alternative would be that the low-order bits of the address are used to swizzle the individual bytes of the 8-byte word).

Usually, this is up to the compiler to make sure that memory accesses are always aligned (even if sometimes it seems to err on the side of caution, as in this thread http://forums.nvidia.com/index.php?showtopic=102678).

If you take a look at cuda/include/vector_types.h, you’ll see that int2 is defined as:

/*DEVICE_BUILTIN*/

struct __builtin_align__(8) int2

{

  int x, y;

};

Which tells the compiler that int2 objects are always aligned on 8-byte boundaries.

So when you cast from a char* pointer to an int2* pointer, you are essentially lying to the compiler.

You could use your own (unaligned) struct of chars instead of int2, but then the compiler will split your 8-byte memory accesses into eight 1-byte memory accesses, which will defeat the purpose of your code…

So your workaround looks correct.

Well, at least it makes another selling point for Ocelot, don’t it? ;)

So after digging a little deeper into this, it turns out that the following lines

int2 tempA = ((int2*)source)[ i + index + stride * 0 ];

int2 tempB = ((int2*)source)[ i + index + stride * 1 ];

((int2*)destination)[ i + index + stride * 0 ] = tempA;

((int2*)destination)[ i + index + stride * 1 ] = tempB;

get compiled into

ld.global.v2.s32 	{%r12,%r13}, [%rd24+0];

	ld.global.v2.s32 	{%r14,%r15}, [%rd28+0];

	st.global.v2.s32 	[%rd25+0], {%r12,%r13};

	st.global.v2.s32 	[%rd29+0], {%r14,%r15};

And from the PTX manual

"The address must be naturally aligned to a multiple of the access size. If an address is

not properly aligned, the resulting behavior is undefined; i.e., the access may proceed

by silently masking off low-order address bits to achieve proper rounding, or the

instruction may fault."

This gave me some trouble before because Ocelot proceeded to happily do the unaligned access and the GPU that I am using (GTX 285) silently masked off the lower bits. Usually when something is unspecified I go with the easiest behavior which, in Ocelot, was to just do the access unaligned. This has annoyed me enough here though that I’m going to change the default behavior in Ocelot to throw an error here…

So Ocelot now throws an error in this case.

==Ocelot== Emulator failed to run kernel "_ZN3gpu10algorithms4cuda13gatherResultsIjSt4pairIjfEEEvP

S3_IT_T0_ES8_PKmSA_SA_" with exception: 

==Ocelot== [PC 42] [thread 0] [cta 0] ld.global.v2.s32 {%r12, %r13}, [%rd24 + 0] - Memory access at 0x8389674 is not aligned to the access size (8 bytes)

==Ocelot==  At: Memory.h:101:0

I would encourage NVIDIA to do the same since I cannot think of a case where changing the location being loaded is what the programmer intended to do…

Nice. I also just fixed a bug on my side : I wasn’t making sure previously that cuMalloc’ed memory was properly aligned…

Yes, there are very few safety checks at the hardware level on the current generation, and it gets worse as you go lower-level (I’d be happy if the GPU just aborts execution when encountering an invalid opcode instead of doing random things. :) )