I just found something similar to this thing, but a lot worse because even sizeof() in .cu didn’t catch this.
It’s CUDA 2.0 on linux64 (CentOS).
I have these structs
struct SSphere {
STransformNode xform;
SMaterial m;
};
struct STransformNode {
SMat4f xform;
SMat4f inverse;
};
struct SMaterial {
SVec3f ke; // emissive
SVec3f ka; // ambient
SVec3f ks; // specular
SVec3f kd; // diffuse
SVec3f kr; // reflective
SVec3f kt; // transmissive
float shininess;
};
typedef float3 SVec3f;
typedef float4 SVec4f;
struct SMat4f {
SVec4f row[4];
};
struct SMat3f {
SVec3f row[3];
};
I start with generating an array of SSphere on the host. Memory is allocated something like
SSphere* h_spheres = new SSphere[numSpheres];
and each instance is accessed and populated using sphere[0], sphere[1] and etc. Pretty straightforward.
I allocate device memory and copy the data, like the following
SSphere* d_spheres;
const size_t sphereBytes = sizeof(SSphere) * numSpheres;
CUDA_SAFE_CALL( cudaMalloc( (void**) &(d_spheres), sphereBytes ) );
CUDA_SAFE_CALL( cudaMemcpy( d_spheres, h_spheres, sphereBytes, cudaMemcpyHostToDevice ));
d_spheres is passed to the kernel function as an argument, and each sphere is accessed using the usual array index notation.
__global__ void kernel_trace( SSphere* s, int numSpheres, ... ) {
...
for (int i=0, i < numSpheres, i++) {
some_device_function(s[i]);
}
...
}
The code worked as I expected in emulation mode, but not in debug/release mode. So I did a bunch of experiment with my code (obviously, the code snippet above is a simplified version, so there were a lot to suspect… :wacko: ) and the answer was…
when the kernel code access an element of the array sphere, say sphere[1], it points to what &sphere[1] + 4 is supposed to point to. The value of members are accessed all wrong.
I have found this by adding something like the following to the kernel…
// debug..
if (threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.x == 0 && blockIdx.y == 0) {
*sphere0 = scene.spheres[0];
*sphere1 = scene.spheres[1];
}
storage for sphere0 and sphere1 are cudamallocated from host and the result is cudamemcopied back to read out the value (obviously… the only and painful way to see the value). The value of sphere1 was off by 4 bytes.
(gdb) p h_spheres[1]
$5 = {xform = {xform = {row = {{x = 0.312566012, y = 0, z = 0, w = 1.39952004},
...
(gdb) p h_sphere1
$6 = {xform = {xform = {row = {{x = 0, y = 0, z = 1.39952004, w = 0}, {
...
I simply padded the SSphere with additional 4 bytes (one float) then the device target started working.
In this case, sizeof(SSphere) is reported as 204 in both .cpp and .cu code.
Is all structs on the global memory should be aligned to 8 bytes? I couldn’t find any reference saying that. Am I missing something or is this a compiler bug?