My CUDA compute capability is 2.1 and i read that this version has a 32 bank shared memory and 32 warp size but i cant find anywhere whether it is executing simultaneously in half warps or warps.
My second question is: suppose i have 1 block in my code with 32 x 32 threads and i declare a shared memory in kernel as:
shared int As.
Now, in what order are the threads going to execute the kernel since i want to achieve bank conflicts so are they going to do: (imagine 32 at the same time)
First: ty=0, tx: 0 -31.
Second: ty =1, tx: 32 - 63
or maybe in reverse order or at all at random. I want to know this since i want to avoid bank conflicts with shared memory and if i do:
As[tx][ty] - and they execute as shown first above then they all access same bank!
Thanks for any help!!!