Data structure layout in memory different for host and device

I have run into a problem with data structures being laid out differently in device and host code.

I’ve isolated some code to test for this.

class Managed {
public :
	void * operator new( size_t size ) {
		void * ptr ;
		::cudaMallocManaged( &ptr, size, cudaMemAttachGlobal ) ;
		return ptr ;
	}
	void operator delete( void * ptr ) {
		cudaFree( ptr ) ;
	}
} ;

class MySymbol : public Managed {
	unsigned int value ;
} ;

class MySymbol2 {
	unsigned int value ;
} ;

template<class K,class V> class MyPair : public Managed {
public :
	K k ;
	V v ;

    __host__ __device__ void printMemory() {
		printf( "Pair() sizeof=%d\n", sizeof( MyPair<K,V> ) ) ;
		printf( "Offset k=%lld v=%lld\n", ( char * ) &k - ( char *  )this, ( char * ) &v - ( char *  )this ) ;
	}
} ;


template <class T> __global__ void print_memory( T * t ) {
	t->printMemory() ;
}

void TestDataStructureOffset() {
	MyPair<MySymbol,unsigned int> * pair = new MyPair<MySymbol,unsigned int>() ;
	MyPair<unsigned int,unsigned int> * pair2 = new MyPair<unsigned int,unsigned int>() ;
	MyPair<MySymbol2,unsigned int> * pair3 = new MyPair<MySymbol2,unsigned int>() ;

	std::cout << "HOST\n:" ;
	pair->printMemory() ;
	pair2->printMemory() ;
	pair3->printMemory() ;

	std::cout << "\n\nDEVICE\n:" ;
	print_memory<<<1,1>>>( pair ) ;
	print_memory<<<1,1>>>( pair2 ) ;
	print_memory<<<1,1>>>( pair3 ) ;

	cudaDeviceSynchronize() ;
}

On my platform (x64 windows), I get the following result:

HOST
:Pair() sizeof=8
Offset k=0 v=4
Pair() sizeof=8
Offset k=0 v=4
Pair() sizeof=8
Offset k=0 v=4


DEVICE
:Pair() sizeof=12
Offset k=4 v=8
Pair() sizeof=8
Offset k=0 v=4
Pair() sizeof=8
Offset k=0 v=4

It appears that the MyPair<MySymbol,unsigned int> is padded differently on the GPU as compared to the host code. Is this a bug, or intended behavior? This is causing significant problems when porting from C++ CPU to the GPU.

Quoting a section from the programming guide:

The alignment requirement is automatically fulfilled for the built-in types of char, short,
int, long, longlong, float, double like float2 or float4 .
For structures, the size and alignment requirements can be enforced by the compiler
using the alignment specifiers align(8) or align(16)

The programming guide discusses alignment requirements in length; merely search the document for ‘alignment’ or ‘alignment requirement’
Key pages to note would be 75, 90, 197 (document version 5_5)

And good luck; I have by now abandoned structures, particularly mixed-type, as much as possible, in favour of more simplistic 1 dimensional arrays that align without hassle or worry

I’ve filed a bug report for this, as the inconsistency between structs with supertypes vs structs without supertypes seems incorrect.

Indeed it seems that there is an inconsistency on Windows only for this case. I tried your code on Linux (RHEL 6.2) and the host compiler and device compiler agree on class arrangement for all cases. That is, my printout for both HOST and DEVICE is the same as your DEVICE case. The windows host compiler is known to behave a little differently on class arrangement in some cases. A few of those cases are documented here:

[url]Programming Guide :: CUDA Toolkit Documentation

None of the 3 cases listed appears to be an exact fit for your case. In discussion with compiler engineers at NVIDIA, this case may need to get documented as well. A doc update now to reflect this is too late to be incorporated in the CUDA 6.5 docs, since CUDA 6.5 release is imminent. But there may be an update in CUDA 7.0 time frame, to shed light on this.

It was suggested that a possible workaround would be to make the base class (Managed) non-empty. Is this something you could try in your case? I don’t know if there would be any objections to that. Even if there were, it would still be interesting to know your results for test purposes.

As indicated in the above doc link, there are sometimes discrepancies between host and device class arrangement in a few cases that have not been addressed in the compiler, particularly with respect to the windows host compiler. These need to be avoided for the forseeable future in your code that intends to share such classes between host and device.