Advice for multiple variable reduction and understanding NVVP results

Hi folks, for the following example,

float s1 = s2 = s3 = s4 = s5 = s6 = 0;
#pragma acc parallel loop independent reduction(+:s1,s2,s3,s4,s5,s6)
for (int i = 0; i < N; ++i) {
  s1 += SIMPLE_CODE;
  s2 += SIMPLE_CODE;
  s3 += SIMPLE_CODE;
  s4 += SIMPLE_CODE;
  s5 += SIMPLE_CODE;
  s6 += SIMPLE_CODE;

I noticed from NVVP that there were six async memset calls before the reduction and six Memcpy DtoH calls after the reduction. The “size” property for each call is 4 B. The “gaps” between these calls are much longer than these calls themselves.

I understand this can never be the bottleneck for my project, but I think this is a great example to correct (if any) the concepts I misunderstood.

First question is, I am not even sure I was focusing on the correct “duration of time”. For this simple example (1 card 1 context), there are time and durations for “OpenACC”, “Runtime API”, “Driver API”, “Compute”, “Stream”, etc. I figured the durations matched between “Compute” and “Stream”, so they should be the “actual” time spent in calculation or the “actual” time the GPU was waiting. Please correct me if I am wrong.

The second question is, is there any way to minimize the “gaps” between these calls? I am curious if there was anything I missed here.

Thanks for the help.

Hi stw,

I’m not quite grasping your first question. The compute section details the actual “work” done by the GPU. Any of the non-blocked in sections (white) on the “Compute” timeline generally mean that the GPU isn’t doing anything during that duration of time.

Because this is a scalar reduction, there’s no HtoD memory transfer, but the results are copied back (so you see six little slivers on the DtoH timeline). The driver and runtime API detail the time spent in all the extra overhead of handling the GPU. Notice that for this example far more time is spent in the pre and post compute phase of handling GPU initialization. There’s a certain “spin-up” cost to using the GPU and this cost is amortized over time with larger and larger workloads.

For the second question, I don’t think there is a clear way to minimize the gaps in the runtime and driver API. In the general sense, the best way to minimize data transfers between the GPU and CPU and try to batch them as much as possible. There’s a cost to initialization a data transfer each time and it adds up if you liberally move data back and forth. Async calls are also a possibility to move memory around whilst keeping the GPU compute busy.

Thanks for your reply, aglobus.

(To clarify my first question) We can see from the screenshot, a “memset” call on the “Compute” row was highlighted and two more blocks (brown block on the “Driver API” row and blue block on the “OpenACC” row) were highlighted as well. I believe these three blocks may report different durations on the right-hand side property panel. I assume this may also happen to other kernels in the project. I think your reply sort of answered my question that I should mainly use the duration from the “Compute” for benchmarking and optimization. Durations reported by “Driver API” and “OpenACC” are useful to measure the overhead.

Since I know nothing about the compilers, I was told to trust the compilers because they are almost always smarter than me. For the second question, I had assumed that there might be some clear ways to tell the compiler to batch to the data initialization (six “memset” calls) and transfer (six “DToH” calls), at least for this well-defined problem.

Thank you again.