Hi,
I’m just wondering which of the following two methods would be better to consecutive store and access 256 threedimensional vectors of floats in shared memory:
- Using a float3-array: Then thread ‘i’ writes to shared[i].x shared[i].y and shared[i].z
By this e.g. it is possible to use shared[i] = make_float3(…), which is very convenient. But isn’t it that then the 3 float values are mapped to 3 consecutive banks? And that these three writes of the statement shared[i] = make_float3(…) are performed sequentially? Hence when performed by several threads the access pattern is as follows:
1st step:
- Bank 0 is accessed by Thread 0, 5, 10, … to write shared[i].x
- Bank 1 is accessed by no Thread
- Bank 2 is accessed by no Thread
- Bank 3 is accessed by Thread 1, 6, 11, … to write shared[i].x
- Bank 4 is accessed by no Thread
- Bank 5 is accessed by no Thread
- …
2nd step:
- Bank 0 is accessed by no Thread
- Bank 1 is accessed by Thread 0, 5, 10… to write shared[i].y
- Bank 2 is accessed by no Thread
- Bank 3 is accessed by no Thread
- Bank 4 is accessed by Thread 1, 6, 1, … to write shared[i].y
- Bank 5 is accessed by no Thread
- …
3rd step:
- Bank 0 is accessed by no Thread
- Bank 1 is accessed by no Thread
- Bank 2 is accessed by Thread 0, 5, 10…, to write shared[i].z
- Bank 3 is accessed by no Thread
- Bank 4 is accessed by no Thread
- Bank 5 is accessed by Thread 1, 6, 1, … to write shared[i].z
- …
Hence in each step only one of three banks is accessed by the threads of a warp?
- Using three float-arrays, the first for the x-values, the second for the y-values and the third for the z-values. Then thread ‘i’ writes to shared[i] = valX, shared[256+i] = valY, shared[512+i] = valZ which is more complicated than make_float3(…) but should lead to the access pattern:
1st step:
- Bank 0 is accessed by Thread 0, 16, … to write valX
- Bank 1 is accessed by Thread 1, 17, … to write valX
- Bank 2 is accessed by Thread 2, 18, … to write valX
- Bank 3 is accessed by Thread 3, 19, … to write valX
- Bank 4 is accessed by Thread 4, 20, … to write valX
- Bank 5 is accessed by Thread 5, 21, … to write valX
- …
2nd step:
- Bank 0 is accessed by Thread 0, 16, … to write valY
- Bank 1 is accessed by Thread 1, 17, … to write valY
- Bank 2 is accessed by Thread 2, 18, … to write valY
- Bank 3 is accessed by Thread 3, 19, … to write valY
- Bank 4 is accessed by Thread 4, 20, … to write valY
- Bank 5 is accessed by Thread 5, 21, … to write valY
- …
3rd step:
- Bank 0 is accessed by Thread 0, 16, … to write valZ
- Bank 1 is accessed by Thread 1, 17, … to write valZ
- Bank 2 is accessed by Thread 2, 18, … to write valZ
- Bank 3 is accessed by Thread 3, 19, … to write valZ
- Bank 4 is accessed by Thread 4, 20, … to write valZ
- Bank 5 is accessed by Thread 5, 21, … to write valZ
- …
Hence in each step all of the banks are accessed and perfectly consecutive by each half warp?
So the question again: Should I prefer the first or the second approach under timing aspects? And am I right with my assumption that in the first approach the access for writing a float3 vector is sequentially performed as shown?
Thanks for any clarification!
wagalaweia