From the original post https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf, the author reduces the number of blocks and perform a global memory reduction before using shared memory. With this optimization, the kernel performance is improved significantly. By using Nsight Compute profiling, it looks like the L2 cache throughput and DRAM throughput improve significantly. Is there anyone who can help me figure this out? I do not believe that the performance improvement solely comes form the computation saving.
There are at least 2 related reasons.
-
For memory-bound algorithms like this, its desirable to get all global memory loads done as early as possible. This pattern promotes that (loads 2x the data from global before beginning the sweep)
-
For a reduction of a large array like this, we can talk about 2 phases of the algorithm: the global load phase and shared memory sweep phase. We know what the best case is for the global load phase, that is one global load per element to be reduced. But what is the best case for the shared sweep phase? We would like to minimize this, relative to the work done in the global load phase, as much as possible. The change you have depicted doesn’t change the amount of work in the global load phase, but it cuts the amount of work done in half for the shared sweep phase, relative to the number of items loaded in the global load phase.
The item 2 above is actually best considered against other competing objectives. We also want to engage as much of the GPU as possible, and we want to of course promote coalesced global loads as much as possible. So there is a point below which the reduction of the shared sweep phase would negatively impact these other objectives. However we can find a nearly “optimal” spot where we have each SM full of shared sweep activity (i.e. threadblocks) and no extra work beyond that. To make that work, rather than loading one element per thread, or 2 elements per thread, we have a loading loop per thread, that is creating a thread partial sum. These partial sums are then reduced using the sweep phase. This general idea/extension is explored later in that series of reduction optimizations that you linked by Mark Harris.
Thank you so much for your detail explanation. I fully agree with your second point that we can reduce the shared memory workload by doing in this way. However, I still have two questions that I hope you can give me some suggestions.
-
As for your first point, I think that GPU does not use any out of order execution which means that the memory operation will always stall the following operations in the same warp. What’s the benefit of get all global memory as early as possible?
-
How can I understand the improvement from DRAM Throughput and L2 cache throughput? The two approaches mentioned above use coalesced memory access. I assume that both of them have a good global memory access pattern. Does it because the reduce2 launch more thread blocks which cannot be executed in a single wave? Does it because the reduce3 have less kernel launch overhead?
Thank you so much!
The goal is to schedule as much independent work between the loads and any instructions directly depending on those loads. In general, this increases latency tolerance. How much this helps is highly context dependent: sometimes the benefit is minimal, other times the impact is of moderate size. By GPU architecture design most of the latency tolerance is supposed to come from hardware. The compiler must also consider trade-offs with register pressure. Scheduling the loads early increases live ranges which tends to increase register pressure.
Thank you so much! Do you have any idea about the significant improvement of DRAM throughput and L2 cache throughput?
A throughput number represents a quantity divided by time. It is a rate measurement. If any of the statements I made have any actual bearing on the case, and as a result the code runs faster, then the same amount of data will be transferred in a shorter time (the time here, for the sake of argument, is the kernel duration). Same data in less time results in higher throughput.
The rate measurement provided by nsight compute here should not be viewed as some sort of “instantaneous” measurement. There is no meter inside the GPU that gives an instantaneous rate measurement, like the speedometer on an automobile. That is not how nsight compute works. It must calculate throughput for data transfer operations the way I described (bytes transferred divided by time over which the observation is made). Given that nsight compute generally reports metrics and data at the kernel (launch/duration) level, there is no reason to assume this measurement is any different.
Therefore we don’t need to posit an improvement in memory transfer efficiency or any other such notion in order to explain the increase in throughput here. We only need to observe that the kernel duration did in fact decrease for some reason, and therefore the calculation I described results in a higher observed throughput.
If the kernel duration decreased for either of the two reasons I mentioned (which you did not dispute) then its entirely logical that the measured throughputs should increase. We don’t need to look for or propose some other mechanism, other than what is already evident in the calculation itself: bytes divided by time.
Thank you so much! If the time duration is the kernel level, I fully understand this result. Thank you again!
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.