Hi selah,
my understanding of the code in the case you mention (size=32, len=25) was that
- at line 3, “mask” will have set only the bits 0-24 (the threads for which threadIdx.x < len)
- at line 5, only the threads 0-24 will enter the “if”
- at line 6, “val” will contain the corresponding element
- inside the for loop at line “7”, “val” is accumulated to the (current-offset) lane
IMHO the mask and the execution coherently indicate to work on all and only the threads 0-24.
So, after each iteration, the value of “val” in each thread will accumulate the values from the threads that are 16, 8, 4, 2, 1 lanes after it.
Before the loop (here I write “N” to mean “input[N]” and “##” for unused values):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ## ## ## ## ## ## ##
After the first iteration (offset = 16):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ## ## ## ## ## ## ##
+16 +17 +18 +19 +20 +21 +22 +23 +24
After the second iteration (offset = 8):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ## ## ## ## ## ## ##
+8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+16 +17 +18 +19 +20 +21 +22 +23 +24
+24
After the third iteration (offset = 4):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ## ## ## ## ## ## ##
+4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+16 +17 +18 +19 +20 +21 +22 +23 +24
+20 +21 +22 +23 +24
+24
After the fourth iteration (offset = 2):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ## ## ## ## ## ## ##
+2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+16 +17 +18 +19 +20 +21 +22 +23 +24
+18 +19 +20 +21 +22 +23 +24
+20 +21 +22 +23 +24
+22 +23 +24
+24
After the last iteration (offset = 1):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ## ## ## ## ## ## ##
+1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+15 +16 +17 +18 +19 +20 +21 +22 +23 +24
+16 +17 +18 +19 +20 +21 +22 +23 +24
+17 +18 +19 +20 +21 +22 +23 +24
+18 +19 +20 +21 +22 +23 +24
+19 +20 +21 +22 +23 +24
+20 +21 +22 +23 +24
+21 +22 +23 +24
+22 +23 +24
+23 +24
+24
So, after the last iteration, “val” for thread 0 contains the sum of all the elements, etc.
selah, Robert, what did I understand wrong ?