Compute capability 2.1 - no. of banks

Hey guys.
My CUDA compute capability is 2.1 and i read that this version has a 32 bank shared memory and 32 warp size but i cant find anywhere whether it is executing simultaneously in half warps or warps.

My second question is: suppose i have 1 block in my code with 32 x 32 threads and i declare a shared memory in kernel as:

shared int As[32][32].

Now, in what order are the threads going to execute the kernel since i want to achieve bank conflicts so are they going to do: (imagine 32 at the same time)

First: ty=0, tx: 0 -31.
Second: ty =1, tx: 32 - 63

etc

or maybe in reverse order or at all at random. I want to know this since i want to avoid bank conflicts with shared memory and if i do:

As[tx][ty] - and they execute as shown first above then they all access same bank!

Thanks for any help!!!

The CUDA Programming Guide Section 2.2 gives the mapping from 2D or 3D thread indexing to thread ID. For the 2D case, it is threadIdx.x + blockDim.x * threadIdx.y. Threads are then grouped into warps by consecutive thread ID.

Ok, so if i write out on paper the thread IDs from 2D to 1D and divide it into 32 chunks (warps), they will be executed in such order simultaneously. Is that correct?