Hello,
I am looking at the code of CUB, specifically block_scan_raking.cuh.
From the following function for Upsweep, it looks like the values in shared memory are copied to registers of a thread before reduction is done. Is there a benefit to this as opposed to just reading from shared memory and add that to a thread register?
/// Performs upsweep raking reduction, returning the aggregate
template <typename ScanOp>
__device__ __forceinline__ T Upsweep(
ScanOp scan_op)
{
T *smem_raking_ptr = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
// Read data into registers
CopySegment(cached_segment, smem_raking_ptr, Int2Type<0>());
T raking_partial = cached_segment[0];
return GuardedReduce(cached_segment, scan_op, raking_partial, Int2Type<1>());
}