Some questions about thread synchronization

Part of the code is as follows. Use __syncthreads() operation at the bottom to output the result when threadIdx.x==0. At this time, all threads in the block should have completed the operation. First_sum[0] and first_sum[1] are both there should be a value, but the printed output is that first_sum[1] is 0.

        int lane_id = threadIdx.x / 32;
        half first_sum[2];
        first_sum[lane_id] = *reinterpret_cast<half *>(&fragC[lane_id * 2 + lane_id]);
        uint32_t temp = fragC[lane_id * 2 + lane_id] >> 16;
        half second_sum = *reinterpret_cast<half *>(&temp);
        first_sum[lane_id] += __shfl_down_sync(0xffffffff, first_sum[lane_id], 9);
        first_sum[lane_id] += __shfl_down_sync(0xffffffff, first_sum[lane_id], 18);
        second_sum += __shfl_down_sync(0xffffffff, second_sum, 9);
        second_sum += __shfl_down_sync(0xffffffff, second_sum, 18);
        first_sum[lane_id] += __shfl_down_sync(0xffffffff, second_sum, 4);

        __syncthreads();
        if (threadIdx.x == 0 && blockIdx.x == 0 && blockIdx.y == 0 && ii == 0)
            printf("t_value : %.2f   %.2f \n", __half2float(first_sum[0]), __half2float(first_sum[1]));

lane_id is 0 for threadIdx.x==0. So this thread only writes to first_sum[0]. first_sum[1] is unitialized and any read access in the program actually is UB for the whole program.

Did you expect first_sum to be in shared memory?
Did you want to output first_sum[1] from threadIdx.x==32?
Did you want to write to first_sum[1 - lane_id] at some point?

BTW a nicer variant of *reinterpret_cast<T*>(&i) is reinterpret_cast<T&>(i). Both are potentially UB, but typically work in Cuda code. Since C++20 they should typically be exchanged with bit_cast<T&>(i).

BTW the word lane typically is the thread number 0…31 within a Cuda warp. You use it differently. (Which can make sense in some domains/contexts.)

WhenthreadIdx.x==0 and threadIdx. x==32 , first_sum[0] and first_sum[1] will be written, I used __syncthreads() so when outputting the results, first_sum[0] and fitst_sum[1] should have a value.I misunderstand it?

You only print out when threadIdx.x is equal to zero. Your variable:

    half first_sum[2];

is a thread-local variable. That means that thread 0 has its own separate local array, and thread 1 has its own separate local array, etc.

Therefore, when you print out, you are printing out the values from the threadIdx.x == 0 array, and for threadIdx.x == 0, the value in first_sum[1] is never written.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.