Interpreting output generated by setting PGI_ACC_TIME=1

Hi,

When I set PGI_ACC_TIME=1 in my MPI OpenACC application, I get a dump like this.

time(us): 44,660,964
22: compute region reached 148500 times
22: data copyin transfers: 148500
device time(us): total=16,976,751 max=1,409 min=4 avg=114

22: kernel launched 148500 times
grid: [92] block: [256]
device time(us): total=21,528,393 max=152 min=144 avg=144
elapsed time(us): total=90,369,207 max=67,390 min=168 avg=608
22: data copyout transfers: 148500
device time(us): total=3,537,741 max=1,012 min=12 avg=23

22: data region reached 297000 times
22: data copyin transfers: 148500
device time(us): total=2,164,391 max=1,038 min=5 avg=14

The parallel region is
22: #pragma acc parallel loop present(…) copyin(…)

Could you please let me know what is the data copyin and data copyout that is dumped within the compute region (shown in bold font above). I do not do any copyout in the parallel region. Any copyin I do in the parallel construct I believe, is shown in the data region, data copyin (the last three lines of the dump)

thanks,
Naga

Hi Naga,

What does the compiler feedback messages (-Minfo=accel) show?

My guess is that the compiler is having to implicitly copy in/out some data. Look for the message “implicit copy” in the Minfo messages.

If your still not sure where to look, please post the output from the compilation with -Minfo. Also if you can, please post the code since it may help determining where the extra copies are coming from.

-Mat

Hi Mat,

This is the compiler output:
PDB::EnergyCalculator(float **&, const std::vector<points, std::allocator> &, const std::vector<points, std::allocator> &, points , const unsigned int &, energy &, int):
22, Generating present(vDrugGridData[:],eneGrid[:][:])
Generating copyin(coords[:totDockAtoms])
22, Accelerator kernel generated
Generating Tesla code
22, Generating reduction(+:ene)
24, #pragma acc loop gang /
blockIdx.x /
31, #pragma acc loop vector(256) /
threadIdx.x /
Generating reduction(min:minDis)
45, #pragma acc loop vector(256) /
threadIdx.x */
Generating reduction(min:minIdx)
31, Loop is parallelizable
45, Loop is parallelizable

There is a copyin. No implicit copies. There is a nested reduction, is that the reason?

Code snippet below:
#pragma acc parallel loop vector_length(256) reduction(+:ene) present(…) copyin(…) firstprivate(…) private(minDis, minIdx)
for(unsigned atomCount=0; atomCount < totDockAtoms; ++atomCount)
{

minDis=5000;
minIdx = maxElements;
#pragma acc loop reduction(min:minDis)
for(unsigned counter=0; counter < maxElements; ++counter)
{
float distance = …

if(minDis > distance)
{
minDis = distance;

}

}
#pragma acc loop reduction(min:minIdx)
for(unsigned counter=0; counter < maxElements; ++counter)
{
float distance = …
if((float)distance == (float)minDis)
{
if (minIdx > counter) minIdx = counter;
}

}

ene += eneGrid[minIdx][atomCount];
}

Hi Naga,

The copyin of “coords” is what’s causing the copyin time with the reduction of “ene” causing the copyout since “ene” needs to be copied back to the host. You can present the copyout of ene by putting ene into a OpenACC data region, but of course wont be able to access it on the host unless you copy it back as part of the data region or if put into an “update self(ene)” directive.

-Mat

Hi Mat,

Thats what I thought too initially, but please revisit this dump:
22: compute region reached 148500 times
22: data copyin transfers: 148500
device time(us): total=16,976,751 max=1,409 min=4 avg=114
22: kernel launched 148500 times
grid: [92] block: [256]
device time(us): total=21,528,393 max=152 min=144 avg=144
elapsed time(us): total=90,369,207 max=67,390 min=168 avg=608
22: data copyout transfers: 148500
device time(us): total=3,537,741 max=1,012 min=12 avg=23
22: data region reached 297000 times
22: data copyin transfers: 148500
device time(us): total=2,164,391 max=1,038 min=5 avg=14

data copyin transfers appears twice above, once within the compute region and once within the data region associated (shown in bold). When I do not do the copyin of coords, I get the following dump, where the last data copyin transfers line does not appear and I still have f the copyin happening within the compute region - shown in bold below .

_ZN3PDB16EnergyCalculatorERPPfRKSt6vectorI6pointsSaIS4_EES8_PS4_RKjR6energyi NVIDIA devicenum=0
time(us): 58,689,553
22: compute region reached 148500 times
22: data copyin transfers: 148500
device time(us): total=32,299,442 max=1,095 min=6 avg=217

22: kernel launched 148500 times
grid: [92] block: [256]
device time(us): total=21,394,327 max=153 min=144 avg=144
elapsed time(us): total=104,203,523 max=69,333 min=175 avg=701
22: reduction kernel launched 148500 times
grid: [1] block: [256]
device time(us): total=481,142 max=17 min=3 avg=3
elapsed time(us): total=46,019,386 max=70,315 min=28 avg=309
22: data copyout transfers: 148500
device time(us): total=4,514,642 max=1,152 min=13 avg=30
22: data region reached 297000 times

So, I am not too sure if it is the copyin of coords.

I agree with you regarding the copyout being there due to the reduction.

regards,
Naga

Is the variable in the firstprivate clause an array? If so, that might account for it since the initial value from the host needs to be copied in in order to set the value of the private copies. Give the high overhead cost of each of your kernels (the difference between elapsed and device time), this may be it since allocation and the setting of firstprivate arrays can be costly.

If that’s not it, I’m not sure then and would need to see a reproducing example to tell. You can try reviewing the output from PGI_ACC_DEBUG but I’d highly recommend you to stop the program after a few calls to the kernel. It can be quite verbose outputting every OpenACC runtime call made.

Also, you might be able to save some overhead by not privatizing scalars. By default, scalars are passed as arguments to the kernel. By putting them in a “private” clause, you instead create an array of scalars in global memory and then a pointer to the global array is passed in. This can cause some extra overhead which can add up when calling the kernel ~150,000 times.

Hi Mat,

I extracted the region of code that was causing this behavior and profiled it using nvprof and switching on api trace, gpu trace and openacc trace.

The trace is below:
938.05ms 28.053us - - - - - - - - - Tesla P100-PCIE 1 14 acc_enqueue_launch@test.cpp:62 (Z15testfunc_62_gpuPPfP6pointsiiS2)
938.05ms 24.699us - - - - - - - - - - - - cuLaunchKernel (testfunc_62_gpu(float**, points*, int, int, points*) [73226])
938.08ms 12.511us - - - - - - - - - Tesla P100-PCIE 1 14 acc_enqueue_launch@test.cpp:62 (Z20testfunc_62_gpu__redPPfP6pointsiiS2)
938.08ms 108.26us (92 1 1) (256 1 1) 30 32B 2.0000KB - - - - Tesla P100-PCIE 1 14 testfunc_62_gpu(float**, points*, int, int, points*) [73226]
938.08ms 10.594us - - - - - - - - - - - - cuLaunchKernel (testfunc_62_gpu__red(float**, points*, int, int, points*) [73227])
938.10ms 106.34us - - - - - - - - - Tesla P100-PCIE 1 14 acc_enqueue_download@test.cpp:62
938.10ms 104.82us - - - - - - - - - - - - cuMemcpyDtoHAsync
938.19ms 3.3600us (1 1 1) (256 1 1) 12 0B 2.0000KB - - - - Tesla P100-PCIE 1 14 testfunc_62_gpu__red(float**, points*, int, int, points*) [73227]
938.19ms 1.0560us - - - - - 8B 7.2248MB/s Device Pageable Tesla P100-PCIE 1 14 [CUDA memcpy DtoH]

The cuMemcpyDtoHAsync() (I believe this is an implicit call to copy back the reduced value to the host) takes ~100 usec. Any idea why the cuMemcpyDtoHAsync() is so high (the actual transfer which is of 8 bytes is only ~1 usec (CUDA memcpy D to H)?


The relevant code is below:
double testfunc(float **egrid, points_t *coords1, int ncoords1, int natoms, points_t *coords2)
{
float minDis;
unsigned int minIdx;
double ene = 0.0;

#pragma acc parallel loop vector_length(256) reduction(+:ene) present(egrid,coords1) copyin(coords2[0:natoms]) firstprivate(natoms, ncoords1) private(minDis, minIdx)
for(unsigned atomCount=0; atomCount < natoms; ++atomCount)
{
minDis=5000000;
minIdx = ncoords1;
#pragma acc loop reduction(min:minDis)
for(unsigned counter=0; counter < ncoords1; ++counter)
{
float distance = sqrtf((coords2[atomCount].xco - coords1[counter].xco) * (coords2[atomCount].xco - coords1[counter].xco) +
(coords2[atomCount].yco - coords1[counter].yco) * (coords2[atomCount].yco - coords1[counter].yco) +
(coords2[atomCount].zco - coords1[counter].zco) * (coords2[atomCount].zco - coords1[counter].zco));

if(minDis > distance)
{
minDis = distance;
}
}
#pragma acc loop reduction(min:minIdx)
for(unsigned counter=0; counter < ncoords1; ++counter)
{
float distance = sqrtf((coords2[atomCount].xco - coords1[counter].xco) * (coords2[atomCount].xco - coords1[counter].xco) +
(coords2[atomCount].yco - coords1[counter].yco) * (coords2[atomCount].yco - coords1[counter].yco) +
(coords2[atomCount].zco - coords1[counter].zco) * (coords2[atomCount].zco - coords1[counter].zco));
if((float)distance == (float)minDis)
{
if (minIdx > counter) minIdx = counter;
}

}

ene += egrid[minIdx][atomCount];
}
return ene;
}

regards,
Naga

Any idea why the cuMemcpyDtoHAsync() is so high (the actual transfer which is of 8 bytes is only ~1 usec (CUDA memcpy D to H)?

There is host side overhead associated with this call so this time isn’t too unusual. Things like time to launch the call and time to copy the data from pinned to virtual host memory.

Also keep in mind that it’s being run asynchronously so some of it’s time is concurrent with other processes. Though here it looks like it may be blocked via a CUDA stream waiting for the reduction kernel to finish, so some of the time is being spent waiting.

-Mat