Understanding overhead from "libacchost::check_present" visible in nsight_systems trace

david.gutzwiller · April 5, 2022, 5:20pm

Hello,

I am attempting to optimize a portion of my code that launches many small device loops. The work can be logically separated into independent queues and thus seems like a good candidate for asynchronous execution. All data is fully present on the device and there is no data transfer over PCIe, only kernel launches.

All of the targeted loops are quite simple, with references to the necessary data listed in a present clause as shown below. Async execution is specified with the async clause, with an acc wait occurring later after multiple kernels have been launched.

#pragma acc parallel loop independent if (target == niopen::LOCATION::DEVICE) \
present(dens, totalenergy, ener, pres, temp, sspd, enth, uvel, vvel, wvel) async(asyncQueue)
for (T_Index i = 0; i < N; ++i)
{
    <thread safe math>
}

[-Minfo ouput]
nifr::IdealGasArraySolutionConverter::calculation_CompletePrimSolution(PrimitiveVariables &, int) const:
    129, Generating present(totalenergy[:1],temp[:1],dens[:1],enth[:1],sspd[:1],pres[:1],ener[:1],uvel[:],wvel[:],vvel[:])
         Generating Tesla code
        133, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Tracing this code I see that the device kernels are correctly assigned to separate queues. There is no obvious asynchronous execution, but perhaps I need to try with smaller loop sizes to trigger true async launches. However, I would at least expect to see some overlap of the kernel launch and compute for multiple kernels piled up in the same queue.

While digging into the trace data I noticed that many of the CPU samples have a similar stack at libacchost::check_present, as shown in the attached screenshot. Am I correct to assume that this is time spent verifying that data in the present clause is actually present on the device? This leads me to my questions:

Is it possible the time spent in check_present is getting in the way of true asynchronous execution?
Is it possible to fully bypass the present check? I am confident that data I specify in the present clause are indeed present, and in the case of failure I would be ok with a generic CUDA error rather than the more verbose OpenACC present check error if it yields better performance.
Can I influence this behavior with a default(present) clause or perhaps by passing in device pointers (deviceptr clause) instead of host pointers?

Thanks for you insight,

-David
nsys_check_present

david.gutzwiller · April 5, 2022, 5:50pm

I forgot to add my system information:

System: CUDA 11.3, driver version 465.19.01, Linux Centos-7.
GPU: GV100, PCIe
Compiler: NVIDIA HPC SDK 21.2

MatColgrove · April 5, 2022, 6:53pm

Hi David,

Is it possible the time spent in check_present is getting in the way of true asynchronous execution?

The probably a slight bit of overhead, but more likely the issue is due to kernel launch overhead. The exact about of overhead will vary, but ~10-40us is not uncommon. The gaps look to be just above 10us so correspond with the kernel overhead time.

Is it possible to fully bypass the present check? I am confident that data I specify in the present clause are indeed present, and in the case of failure I would be ok with a generic CUDA error rather than the more verbose OpenACC present check error if it yields better performance.

Are the routines all within the same scoping unit? If so, you can create a structured data region spanning across multiple compute region. Hence the present look up only need be done once for all within this region.

Can I influence this behavior with a default(present) clause or perhaps by passing in device pointers (deviceptr clause) instead of host pointers?

“default(present)” wouldn’t help since that’s no different than if you list all the pointers in a present clause yourself. Using deviceptr might help since the compiler would no longer need to do the host to device address translation. It may be more work on your part in having to manually manage device pointers, but if you use a higher level “host_data” region and ensure you don’t access the now device pointers on the host within this region, it may not be too difficult.

but perhaps I need to try with smaller loop sizes to trigger true async launches.

I’d say the opposite in that the kernels themselves are only running for ~8us. In order to hide the launch latency, the kernels need to run at least as long as the latency. I’d try giving each loop more work so it gets into the 20-40us range, or maybe even longer, if it also reduces the number of necessary kernel launches.

Now you don’t actually need separate queues (streams) for this since the asynchronization is between the CPU and GPU control flows. Streams are used to for asynchronization between GPU control flows such as data transfer and compute. Most kernels fully utilize the GPU so it’s rare to see more than a tiny amount of overlap in compute as one kernel begins to release resources. Only if the kernels use a limited number of gangs (blocks) can a program can there be any significant concurrent execution on the device.

If you don’t mind introducing CUDA to your code, another option is use CUDA graphs: Getting Started with CUDA Graphs | NVIDIA Technical Blog

Note that we now have some limited support for CUDA directly in nvc++ so you shouldn’t need to use nvcc if you add CUDA. graphs.

Hope this helps,
Mat

david.gutzwiller · April 5, 2022, 8:51pm

Hi Mat,

Thanks for the quick response.

Unfortunately this project is a big C++ program with the routines scattered throughout different scopes. I have used this structured data region trick in some other parts of the code with success, so indeed it is a good idea.

Using the deviceptr directly is something I could easily implement and test. Seems like it may be worth a shot.

I think the long term solution is to reorganize this code so it can operate with fewer loops / kernels. I just need to convince myself to take this big jump…

Thanks again for your input, for posterity I’ll let you know how it turns out.

-David

david.gutzwiller · April 7, 2022, 11:08pm

Hi Mat,

A quick followup to my last post. I have some evidence that my original hunch about the check_present may have been relevant. In our code we have a data container class that stores all of the fields in a single contiguous array. Inside this class we have member “views” that are aliased to a subarray within the contiguous array. The full class is offloaded to the device (this, contiguous array, and the views).

In the code snippet from Tuesday I was operating on the views, all of which were in the present clause (temp, dens, pres, etc). I modified the code to access each view indirectly with accessor methods in the container class. Thus, only one entry (the container) is needed in the present clause. The underlying storage is the same, and the actual work being done is the same. This modified implementation is shown in the snippet below:

#pragma acc parallel loop independent if (target == niopen::LOCATION::DEVICE) \
present(prim) async(asyncQueue)
for (T_Index i = 0; i < N; ++i)
{
    <thread safe math>
}

[-Minfo ouput]
nifr::IdealGasArraySolutionConverter::calculation_CompletePrimSolution(PrimitiveVariables &, int) const:
    127, Generating present(prim[:1])
         Generating Tesla code
        131, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

The impact on the performance is significant, with the overall runtime for this portion of our code reduced by 50%. A trace with Nsight confirms the behavior, with the small kernels piled up cleanly (i’m showing one queue in this screenshot, but it works well with multiple queues as well).

This kind of thing only seems significant for very very small kernels, which we happen to have a lot of in this part of our code. If confirmed, I wouldn’t be surprised if the present check accounts for 5%-10% of our total runtime. If you like, I could put together a small reproducer code that could be shared with the compiler engineers. If there was an option for “release” code where all present checks are skipped I think it would be beneficial.

Thanks,

-David

MatColgrove · April 8, 2022, 3:30pm

Sure, that would be great… I’m not sure how much improvement can be made, but it would be good for them to take a look.

-Mat

david.gutzwiller · April 8, 2022, 4:12pm

Sounds good. I’ll throw something together when I have a free afternoon.

-David