Part of the code is as follows. Use __syncthreads() operation at the bottom to output the result when threadIdx.x==0. At this time, all threads in the block should have completed the operation. First_sum[0] and first_sum[1] are both there should be a value, but the printed output is that first_sum[1] is 0.
lane_id is 0 for threadIdx.x==0. So this thread only writes to first_sum[0]. first_sum[1] is unitialized and any read access in the program actually is UB for the whole program.
Did you expect first_sum to be in shared memory?
Did you want to output first_sum[1] from threadIdx.x==32?
Did you want to write to first_sum[1 - lane_id] at some point?
BTW a nicer variant of *reinterpret_cast<T*>(&i) is reinterpret_cast<T&>(i). Both are potentially UB, but typically work in Cuda code. Since C++20 they should typically be exchanged with bit_cast<T&>(i).
BTW the word lane typically is the thread number 0…31 within a Cuda warp. You use it differently. (Which can make sense in some domains/contexts.)
WhenthreadIdx.x==0 and threadIdx. x==32 , first_sum[0] and first_sum[1] will be written, I used __syncthreads() so when outputting the results, first_sum[0] and fitst_sum[1] should have a value.I misunderstand it?
You only print out when threadIdx.x is equal to zero. Your variable:
half first_sum[2];
is a thread-local variable. That means that thread 0 has its own separate local array, and thread 1 has its own separate local array, etc.
Therefore, when you print out, you are printing out the values from the threadIdx.x == 0 array, and for threadIdx.x == 0, the value in first_sum[1] is never written.