float3-array versus 3 float-arrays in shared memory?

Hi,
I’m just wondering which of the following two methods would be better to consecutive store and access 256 threedimensional vectors of floats in shared memory:

  1. Using a float3-array: Then thread ‘i’ writes to shared[i].x shared[i].y and shared[i].z
    By this e.g. it is possible to use shared[i] = make_float3(…), which is very convenient. But isn’t it that then the 3 float values are mapped to 3 consecutive banks? And that these three writes of the statement shared[i] = make_float3(…) are performed sequentially? Hence when performed by several threads the access pattern is as follows:

1st step:

  • Bank 0 is accessed by Thread 0, 5, 10, … to write shared[i].x
  • Bank 1 is accessed by no Thread
  • Bank 2 is accessed by no Thread
  • Bank 3 is accessed by Thread 1, 6, 11, … to write shared[i].x
  • Bank 4 is accessed by no Thread
  • Bank 5 is accessed by no Thread

2nd step:

  • Bank 0 is accessed by no Thread
  • Bank 1 is accessed by Thread 0, 5, 10… to write shared[i].y
  • Bank 2 is accessed by no Thread
  • Bank 3 is accessed by no Thread
  • Bank 4 is accessed by Thread 1, 6, 1, … to write shared[i].y
  • Bank 5 is accessed by no Thread

3rd step:

  • Bank 0 is accessed by no Thread
  • Bank 1 is accessed by no Thread
  • Bank 2 is accessed by Thread 0, 5, 10…, to write shared[i].z
  • Bank 3 is accessed by no Thread
  • Bank 4 is accessed by no Thread
  • Bank 5 is accessed by Thread 1, 6, 1, … to write shared[i].z

Hence in each step only one of three banks is accessed by the threads of a warp?

  1. Using three float-arrays, the first for the x-values, the second for the y-values and the third for the z-values. Then thread ‘i’ writes to shared[i] = valX, shared[256+i] = valY, shared[512+i] = valZ which is more complicated than make_float3(…) but should lead to the access pattern:

1st step:

  • Bank 0 is accessed by Thread 0, 16, … to write valX
  • Bank 1 is accessed by Thread 1, 17, … to write valX
  • Bank 2 is accessed by Thread 2, 18, … to write valX
  • Bank 3 is accessed by Thread 3, 19, … to write valX
  • Bank 4 is accessed by Thread 4, 20, … to write valX
  • Bank 5 is accessed by Thread 5, 21, … to write valX

2nd step:

  • Bank 0 is accessed by Thread 0, 16, … to write valY
  • Bank 1 is accessed by Thread 1, 17, … to write valY
  • Bank 2 is accessed by Thread 2, 18, … to write valY
  • Bank 3 is accessed by Thread 3, 19, … to write valY
  • Bank 4 is accessed by Thread 4, 20, … to write valY
  • Bank 5 is accessed by Thread 5, 21, … to write valY

3rd step:

  • Bank 0 is accessed by Thread 0, 16, … to write valZ
  • Bank 1 is accessed by Thread 1, 17, … to write valZ
  • Bank 2 is accessed by Thread 2, 18, … to write valZ
  • Bank 3 is accessed by Thread 3, 19, … to write valZ
  • Bank 4 is accessed by Thread 4, 20, … to write valZ
  • Bank 5 is accessed by Thread 5, 21, … to write valZ

Hence in each step all of the banks are accessed and perfectly consecutive by each half warp?

So the question again: Should I prefer the first or the second approach under timing aspects? And am I right with my assumption that in the first approach the access for writing a float3 vector is sequentially performed as shown?

Thanks for any clarification!
wagalaweia

what about allocating more smem but keeping the same number of threads? That’s the best of both worlds, because padded global reads are automatically performed on float3s, and you perform fewer jumps just to allocate arrays.

In any case, why cross warps so much? Why not just use every third element to store z and so on?

Hi while I think you got the numbers a bit incorrect in your first example, I am quite sure it is correct that using float3 instead of 3 float will lead to bank conflicts and hence reduce the overall performance.
Best regards
ceearem

Thanks for your both replies. In the first example I really counted wrong! For float3 the 4th thread writes its x-value to bank 12 and thread 5 writes x to Bank 15 (and thread 6 to bank 2). Hence there are no more bank conflicts. Only for float4 the thread 4 would write its x-value to bank 0.

But anyway: I just ran several examples on this for float3 and float4 in practice. The result: There is nearly no difference in performance, and if, only in the opposite: the usage of float3 and float4 seems even to be a bit faster than using 3 (resp. 4) float-arrays therefore. Don’t know why, perhaps a look to the assembler code would clearify this, though I will not go deeper into this.

Hence using float3 and float4 seems to be the better choice, since it is much more convenient, and at least as fast as the other approach.

Actually, nvcc automatically uses different shared memory banks for each float in a float3, therefore, when you access to a float3 on shared mamory, since each value resides in a different bank, there are no bank conflicts.

If I’m not mistaken, you can check it out in the Programming guide.

Bye…