Structure Alignment? CUDA Structure Alignment differs?

Hi,

I have a problem with a struct that I have declared (16 byte aligned would only need 84 bytes). Calling a sizeof(MyStruct) in main.cpp returns me 96, calling sizeof(MyStruct) in my .cu file returns 112. I’m using a 64-bit Linux so pointers are 8 byte long.

After thinking it through it appears CUDA adds more padding than is actually needed. So my guess is it adds extra padding to the float3s and the ints,

struct __align__(16) MyStruct

{

  float3    a; //12

  float3    b; //12

  float3    c; //12

  float4    d; //16

  float*    px; //8

  int       pxo; //4

  float*    pt; //8

  float*    py; //8

  int       pyo; //4

}

The way I think CUDA interprets the memory layout. ( o being data, x being padding)

[oooo oooo oooo xxxx]

[oooo oooo oooo xxxx]

[oooo oooo oooo xxxx]

[oooo oooo oooo oooo]

[oooo oooo oooo xxxx]

[oooo oooo oooo oooo]

[oooo xxxx xxxx xxxx] = 112

The way C/C++ sees it

[oooo oooo oooo oooo]

[oooo oooo oooo oooo]

[oooo oooo oooo oooo]

[oooo oooo oooo oooo]

[oooo oooo oooo oooo]

[oooo xxxx xxxx xxxx] = 96

So in essence the later is packed while the first one is fully padded.

Can anyone confirm this? I was using the CPP integration example code as a basis for my program.

Regards,

Linny

EDIT: corrected CUDA memory layout

i confirm.

is there a #pragma pack directive in the nvcc compiler?

With this you might be able manually tune the struct member packing…

There’s one for the gcc, forcing the host to see it like CUDA would help, however it doesn’t yield the results needed (most likely it’s me, but I can’t get it to work so the layout looks like CUDA). Introducing dummy padding variables helps but I don’t think it’s a good solution…

Why not just reorder the struct members by decreasing alignment requirements? Like that no padding is necessary and it will save space on the GPU, and should not be that ugly (just make sure to document that ;-) ).

Well, that would certainly work for some cases, but not all. Take e.g. two float3 and a float4, not matter what way you arrange them there will always be at least 4 bytes of padding necessary for one of the float3.

I just found something similar to this thing, but a lot worse because even sizeof() in .cu didn’t catch this.

It’s CUDA 2.0 on linux64 (CentOS).

I have these structs

struct SSphere {

  STransformNode xform;

  SMaterial m;

};

struct STransformNode {

  SMat4f xform;

  SMat4f inverse;

};

struct SMaterial {

  SVec3f ke;  // emissive

  SVec3f ka;  // ambient

  SVec3f ks;  // specular

  SVec3f kd;  // diffuse

  SVec3f kr;  // reflective

  SVec3f kt;  // transmissive

float shininess;

};

typedef float3 SVec3f;

typedef float4 SVec4f;

struct SMat4f {

  SVec4f row[4];

};

struct SMat3f {

  SVec3f row[3];

};

I start with generating an array of SSphere on the host. Memory is allocated something like

SSphere* h_spheres = new SSphere[numSpheres];

and each instance is accessed and populated using sphere[0], sphere[1] and etc. Pretty straightforward.

I allocate device memory and copy the data, like the following

SSphere* d_spheres;

const size_t sphereBytes = sizeof(SSphere) * numSpheres;

CUDA_SAFE_CALL( cudaMalloc( (void**) &(d_spheres), sphereBytes ) );

CUDA_SAFE_CALL( cudaMemcpy( d_spheres, h_spheres, sphereBytes, cudaMemcpyHostToDevice ));

d_spheres is passed to the kernel function as an argument, and each sphere is accessed using the usual array index notation.

__global__ void kernel_trace( SSphere* s, int numSpheres, ... ) {

...

for (int i=0, i < numSpheres, i++) {

	some_device_function(s[i]);

}

...

}

The code worked as I expected in emulation mode, but not in debug/release mode. So I did a bunch of experiment with my code (obviously, the code snippet above is a simplified version, so there were a lot to suspect… :wacko: ) and the answer was…

when the kernel code access an element of the array sphere, say sphere[1], it points to what &sphere[1] + 4 is supposed to point to. The value of members are accessed all wrong.

I have found this by adding something like the following to the kernel…

// debug..

if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.x == 0 && blockIdx.y == 0) {

  *sphere0 = scene.spheres[0];

  *sphere1 = scene.spheres[1];

}

storage for sphere0 and sphere1 are cudamallocated from host and the result is cudamemcopied back to read out the value (obviously… the only and painful way to see the value). The value of sphere1 was off by 4 bytes.

(gdb) p h_spheres[1]

$5 = {xform = {xform = {row = {{x = 0.312566012, y = 0, z = 0, w = 1.39952004}, 

...

(gdb) p h_sphere1

$6 = {xform = {xform = {row = {{x = 0, y = 0, z = 1.39952004, w = 0}, {

...

I simply padded the SSphere with additional 4 bytes (one float) then the device target started working.

In this case, sizeof(SSphere) is reported as 204 in both .cpp and .cu code.

Is all structs on the global memory should be aligned to 8 bytes? I couldn’t find any reference saying that. Am I missing something or is this a compiler bug?

The same sizeof doesn’t necessarily mean same layout in memory. To avoid confusion you should specify the alignment explicitly, I think I remember something about the default being 16 for larger structs, but I could be wrong.
SMaterial will need padding, since the float3->SVec3f is 16 aligned. My advice would be to first explicitly specify alignment and then add padding bytes for the SVec3fs in SMaterial.

Hope that helps.

I am glad that I’ve decided to use the 32 bit tool kit and SDK, even on 64 bit systems. With this solution I don’t have any of these (mis)alignment issues.

Not sure if this is helpful to anybody. I’ve tried to come up with an ALIGN macro that works in nvcc as well as gcc and on Linux and Windows. You can apply this to structures and custom types as needed.

// Struct alignment is handled differently between the CUDA compiler and other

// compilers (e.g. GCC, MS Visual C++ .NET)

#ifdef __CUDACC__

#define ALIGN(x)  __align__(x)

#else

#if defined(_MSC_VER) && (_MSC_VER >= 1300)

// Visual C++ .NET and later

#define ALIGN(x) __declspec(align(x))

#else

#if defined(__GNUC__)

// GCC

#define ALIGN(x)  __attribute__ ((aligned (x)))

#else

// all other compilers

#define ALIGN(x)

#endif

#endif

#endif

/**

 * this is how I typically use the macros

 */

typedef struct ALIGN(1) _SomeTypeStruct

{

} SomeType;

Does passing -malign-double to gcc change anything? The release notes mention it is needed with structs containing long longs, maybe it applies to pointers too?

This sounds more like the case.

How do I pass the argument ‘-malign-double’ to gcc? I compile using modified Makefile that includes common.mk. (Well, I couldn’t find any documentation explaining better than doing that…) I tried adding -malign-double to CXXFLAGS and CFLAGS, which didn’t work - cc1plus compained about an command line argument it doesn’t understand. So I changed CXX and CC to nvcc (formerly g++ and gcc, repectively) but the generated binary still works incorrectly.

The very first element of your structure is a float4, and that clearly must be aligned to 16 bytes (so that it can be fetched in one opcode). So your whole structure must be aligned to 16 bytes.

The question is why nvcc doesn’t realize this itself. align(16) does fix the situation.

Bug, bug bug bug.

P.S. __alignof(SSphere) does report 16.

P.P.S. I see float3 isn’t being aligned, unlike what the OP of this thread claimed. Is there a final word on that issue?

Vista64

CUDA 2.1

180.84