problem coping data from global to shared problem coping data from global to share

I defined a float12 struct as flows:

typedef struct builtin_align(16)
float v1_x,v1_y,v1_z,v1_w,v2_x,v2_y,v2_z,v2_w,v3_x,v3_y,v3_z,v3_w;
} float12;

When i try to copy an item from an array from global memory to the shared memory i get a crash when running in real mode (emu mode works fine)

s_[threadIdx.x] = g_[startIndex+threadIdx.x];

also tried different ways (3 float4 and more) and i keep running into the same problem, if i try to copy an item from an array in global memory to an item in shared it works in emu mode but crashes in real mode. It works fine when i copy the data from global to a variable i declare local. But to copy the data to the shared memory space the only solution i found is:

s_[threadIdx.x].v1_x = g_[startIndex+threadIdx.x].v1_x;
s_[threadIdx.x].v1_y = g_[startIndex+threadIdx.x].v1_y;
s_[threadIdx.x].v1_z = g_[startIndex+threadIdx.x].v1_z;
s_[threadIdx.x].v1_w = g_[startIndex+threadIdx.x].v1_w;

s_[threadIdx.x].v2_x = g_[startIndex+threadIdx.x].v2_x;
s_[threadIdx.x].v2_y = g_[startIndex+threadIdx.x].v2_y;
s_[threadIdx.x].v2_z = g_[startIndex+threadIdx.x].v2_z;
s_[threadIdx.x].v2_w = g_[starIndex+threadIdx.x].v2_w;

s_[threadIdx.x].v3_x = g_[startIndex+threadIdx.x].v3_x;
s_[threadIdx.x].v3_y = g_[startIndex+threadIdx.x].v3_y;
s_[threadIdx.x].v3_z = g_[startIndex+threadIdx.x].v3_z;
s_[threadIdx.x].v3_w = g_[startIndex+threadIdx.x].v3_w;

whats going on ?!?!?! <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

probably some silly mistake I’m making right ?

Use CUDAs alignment macro rather than the GCC builtin so the alignment gets passed on properly to the G80 code backend.


As far as i know i am using the cuda alignment (same as in the definition of float4) the one from host_defines.h. What i really don’t get is how come the copy works for a local variable but not for a one in shared mem. I just tried coping the data first from global to a local variable and then move it to shared. It works but i don’t get any performance enhancement from it (same results as coping each float from global to shared


I believe prkipfer is correct, and the code is not using nVidia’s recommended alignment syntax.

typedef struct __builtin_align__(16)

should be

typedef struct __align__(16)

I do agree that the builtin_align(a) syntax is in the host_defines.h header, but there are two definitions

#define __builtin_align__(a) \



#define __builtin_align__(a)

So it may be that the build uses the builtin_align(a) with an empty body, which will just disappear.

nVidia have a document about vector-loading here:…vectorLoads.pdf



PS: This might not fix things, but at least using nVidia’s synax, and discovering a bug, may get their attention :)

Exactly. That’s why it is in host_defines.h. If you compile for G80, the code gets segregated differently passing it on to the nvopencc and ptxas. They don’t understand the GCC as they are not GCC. Use the macro align(16). It gets expanded according to the compiler stage you are at.

If you are not familiar how the code gets processed in the different modi, add -v to the nvcc commandline. It will show the actual compiler sequence issued.


Thanks guys ill try it and let u know

:( didn’t do the trick … now im trying just to do a test with float4 from the sdk …

Seems there is another thread with a problem for copieng data from gloabl to shared directly. Has anybody manged to copy data that has a few members in it (like float4) from global to shared directly ?

This is only partly related to the topic: I think it’s not a good idea to keep an array of floatN’s in the shared memory because of N-way bank conflicts. I’d suggest you to have N different float arrays and first read from global memory to a register floatN, then copy the elements of this register to an index of those N different shared arrays.


float4 temp = g_array[offset + tid];

shared_x[tid] = temp.x;

shared_y[tid] = temp.y;

shared_z[tid] = temp.z;

shared_w[tid] = temp.w;


Thanks, ill give that a try later and see if i get better results. Any ways i found the problem … i was allocating the memory on the host using a new command. when switched to malloc it suddenly worked … i assume it has some thing to do with the alignment issues u mentioned earlier. any ways this should be put somewere public like the FAQ …