Hello,
I am attempting to optimize a portion of my code that launches many small device loops. The work can be logically separated into independent queues and thus seems like a good candidate for asynchronous execution. All data is fully present on the device and there is no data transfer over PCIe, only kernel launches.
All of the targeted loops are quite simple, with references to the necessary data listed in a present clause as shown below. Async execution is specified with the async clause, with an acc wait occurring later after multiple kernels have been launched.
#pragma acc parallel loop independent if (target == niopen::LOCATION::DEVICE) \
present(dens, totalenergy, ener, pres, temp, sspd, enth, uvel, vvel, wvel) async(asyncQueue)
for (T_Index i = 0; i < N; ++i)
{
<thread safe math>
}
[-Minfo ouput]
nifr::IdealGasArraySolutionConverter::calculation_CompletePrimSolution(PrimitiveVariables &, int) const:
129, Generating present(totalenergy[:1],temp[:1],dens[:1],enth[:1],sspd[:1],pres[:1],ener[:1],uvel[:],wvel[:],vvel[:])
Generating Tesla code
133, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Tracing this code I see that the device kernels are correctly assigned to separate queues. There is no obvious asynchronous execution, but perhaps I need to try with smaller loop sizes to trigger true async launches. However, I would at least expect to see some overlap of the kernel launch and compute for multiple kernels piled up in the same queue.
While digging into the trace data I noticed that many of the CPU samples have a similar stack at libacchost::check_present, as shown in the attached screenshot. Am I correct to assume that this is time spent verifying that data in the present clause is actually present on the device? This leads me to my questions:
- Is it possible the time spent in check_present is getting in the way of true asynchronous execution?
- Is it possible to fully bypass the present check? I am confident that data I specify in the present clause are indeed present, and in the case of failure I would be ok with a generic CUDA error rather than the more verbose OpenACC present check error if it yields better performance.
- Can I influence this behavior with a default(present) clause or perhaps by passing in device pointers (deviceptr clause) instead of host pointers?
Thanks for you insight,
-David