bug (?) with GCC __attribute__ ((aligned (16))); memory alignment corrupts data

Hello,

I could have find a bug using CUDA with an aligned memory (because I use the SSE instructions on CPU and alignment is good for performance)

In fact if I construct a nested array based on an aligned 16 bytes block, I could corrupt the memory with data exchange between the CPU <-> GPU.

(on mac my and a linux-pc)

I write a small program to put in view the “bug ?”

1 #include <iostream>

  2 #include <cuda.h>

  3 #include <cuda_runtime_api.h>

  4 

  5 #define NUM 3

  6 

  7 template<int N>

  8 struct container

  9 {

 10     int & operator[](int i){return a[i];} //just for left value init

 11 

 12     //int a[N];

 13     int a[N]  __attribute__ ((aligned (16))); //bug if N is not init to a multiple of 8

 14 };

 15 

 16 int main(int argc, char* argv[])

 17 {

 18    container<NUM> a[10],arecv[10]; // set up array

 19 

 20    for(int i=0; i<10;i++){ //init ...

 21        for(int j=0; j<NUM;j++)

 22            a[i][j] = i;

 23    }

 24 

 25    int* pa; //pointer for cuda

 26 

 27    cudaMalloc((void**)&(pa),10* NUM*sizeof(int)); // alloc

 28    cudaMemset((void*)pa,0, 10*NUM*sizeof(int)); // set to 0

 29    cudaMemcpy((void*)pa,(void*)a,10*NUM*sizeof(int),cudaMemcpyHostToDevice);

 30    cudaMemcpy((void*)arecv,(void*)pa,10*NUM*sizeof(int),cudaMemcpyDeviceToHost); 

 31 

 32    for(int i=0; i<NUM*10;i++)

 33        printf(" %i  \n ", *(&arecv[0][0]+i)); // print, should be 000,111,222,333 ....

 34 

 35     return 0;

 36 }

I get with the GCC allocator, for N=3

0  

  0  

  0  

  32767  

  1  

  1  

  1  

  32767  

  2  

  2  

  2  

  32767  

  3  

  3  

  3  

  32767  

  4  

  4  

  4  

  1  

  5  

  5  

  5  

  32767  

  6  

  6  

  6  

  32767  

  7  

  7

instead of

0  

  0  

  0  

  1  

  1  

  1  

  2  

  2  

  2  

  3  

  3  

  3  

  4  

  4  

  4  

  5  

  5  

  5  

  6  

  6  

  6  

  7  

  7  

  7  

  8  

  8  

  8  

  9  

  9  

  9

If I remove the aligned instruction, everything is fine.

my config :

Cuda compilation tools, release 4.0, V0.2.1221

GPU Driver Version: 7.4.10 270.05.05f01

MacOS 10.7

A few personal comments and questions :

[list=1]

[*]Is it a bug ?

[*]The align command should only affect the cpu men not inside the GPU ?

[*]If I align the pointer pa, it does not change anything to the bug, and anyway does it have any importance ?

[*]Tha align command just assure the first address of the array is aligned on 16 bytes, what is the incompatibility with CUDA ?

[*]CUDA API do make tricky stuff incompatible with alignment ?

Thank you, for your help !
main.cpp (937 Bytes)

Why do you think you should get 0,0,0, 1,1,1, 2,2,2,…? You are printing contiguous integers from memory, and have asked for alignment. The gaps caused by aligning of the structure show up as random values.

Oh shame on me ! I forgot this point ! So no bug, and thank you for your constructive remarks !