Why does reducing idle thread improve the performance significantly in reduction?

szhangcj · August 8, 2023, 2:07pm

From the original post https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf, the author reduces the number of blocks and perform a global memory reduction before using shared memory. With this optimization, the kernel performance is improved significantly. By using Nsight Compute profiling, it looks like the L2 cache throughput and DRAM throughput improve significantly. Is there anyone who can help me figure this out? I do not believe that the performance improvement solely comes form the computation saving.

Robert_Crovella · August 8, 2023, 2:41pm

There are at least 2 related reasons.

For memory-bound algorithms like this, its desirable to get all global memory loads done as early as possible. This pattern promotes that (loads 2x the data from global before beginning the sweep)
For a reduction of a large array like this, we can talk about 2 phases of the algorithm: the global load phase and shared memory sweep phase. We know what the best case is for the global load phase, that is one global load per element to be reduced. But what is the best case for the shared sweep phase? We would like to minimize this, relative to the work done in the global load phase, as much as possible. The change you have depicted doesn’t change the amount of work in the global load phase, but it cuts the amount of work done in half for the shared sweep phase, relative to the number of items loaded in the global load phase.

The item 2 above is actually best considered against other competing objectives. We also want to engage as much of the GPU as possible, and we want to of course promote coalesced global loads as much as possible. So there is a point below which the reduction of the shared sweep phase would negatively impact these other objectives. However we can find a nearly “optimal” spot where we have each SM full of shared sweep activity (i.e. threadblocks) and no extra work beyond that. To make that work, rather than loading one element per thread, or 2 elements per thread, we have a loading loop per thread, that is creating a thread partial sum. These partial sums are then reduced using the sweep phase. This general idea/extension is explored later in that series of reduction optimizations that you linked by Mark Harris.

szhangcj · August 9, 2023, 1:41am

Thank you so much for your detail explanation. I fully agree with your second point that we can reduce the shared memory workload by doing in this way. However, I still have two questions that I hope you can give me some suggestions.

As for your first point, I think that GPU does not use any out of order execution which means that the memory operation will always stall the following operations in the same warp. What’s the benefit of get all global memory as early as possible?
How can I understand the improvement from DRAM Throughput and L2 cache throughput? The two approaches mentioned above use coalesced memory access. I assume that both of them have a good global memory access pattern. Does it because the reduce2 launch more thread blocks which cannot be executed in a single wave? Does it because the reduce3 have less kernel launch overhead?

Thank you so much!

njuffa · August 9, 2023, 1:54am

The goal is to schedule as much independent work between the loads and any instructions directly depending on those loads. In general, this increases latency tolerance. How much this helps is highly context dependent: sometimes the benefit is minimal, other times the impact is of moderate size. By GPU architecture design most of the latency tolerance is supposed to come from hardware. The compiler must also consider trade-offs with register pressure. Scheduling the loads early increases live ranges which tends to increase register pressure.

szhangcj · August 9, 2023, 11:17am

Thank you so much! Do you have any idea about the significant improvement of DRAM throughput and L2 cache throughput?

Robert_Crovella · August 9, 2023, 2:40pm

A throughput number represents a quantity divided by time. It is a rate measurement. If any of the statements I made have any actual bearing on the case, and as a result the code runs faster, then the same amount of data will be transferred in a shorter time (the time here, for the sake of argument, is the kernel duration). Same data in less time results in higher throughput.

The rate measurement provided by nsight compute here should not be viewed as some sort of “instantaneous” measurement. There is no meter inside the GPU that gives an instantaneous rate measurement, like the speedometer on an automobile. That is not how nsight compute works. It must calculate throughput for data transfer operations the way I described (bytes transferred divided by time over which the observation is made). Given that nsight compute generally reports metrics and data at the kernel (launch/duration) level, there is no reason to assume this measurement is any different.

Therefore we don’t need to posit an improvement in memory transfer efficiency or any other such notion in order to explain the increase in throughput here. We only need to observe that the kernel duration did in fact decrease for some reason, and therefore the calculation I described results in a higher observed throughput.

If the kernel duration decreased for either of the two reasons I mentioned (which you did not dispute) then its entirely logical that the measured throughputs should increase. We don’t need to look for or propose some other mechanism, other than what is already evident in the calculation itself: bytes divided by time.

szhangcj · August 10, 2023, 6:23am

Thank you so much! If the time duration is the kernel level, I fully understand this result. Thank you again!

system · August 24, 2023, 6:23am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Constant memory provides no improvement CUDA Programming and Performance cuda , algorithm	16	82	January 17, 2025
Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out? CUDA Programming and Performance	2	2310	August 12, 2009
Efficient use of shared memory CUDA Programming and Performance	29	4452	December 2, 2019
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
Best way to find many minimums CUDA Programming and Performance	8	2363	January 3, 2018
Shared Memory/Large Strides CUDA Programming and Performance	2	2528	June 23, 2011
Simple test, unexpected results: more calculations in each thread, less GPU occupancy time! CUDA Programming and Performance	5	1127	May 27, 2013
CUDA perormances CUDA Programming and Performance	10	7129	January 22, 2008
Performance test sharedmemory <-> globalmemory CUDA Programming and Performance	2	3933	May 30, 2008

Why does reducing idle thread improve the performance significantly in reduction?

Related topics