Performance of OpenACC data create

I’ve noticed that the data create(…) clause is very slow, nearly identical to data copyin(…) and in some cases it is even slower. This is tested separately on a Geforce 2000M and also a Geforce Titan Windows 7 with PGI 13.6 64bit using the CUDA 5 runtime.

I’m a bit surprised by this because from what I understand the create clause should just be allocating memory on the GPU without any data copies (as per the OpenACC specification page 17) whereas the copyin clause both allocates memory and copies data from the CPU to the GPU. The size of the arrays I’m allocating is around 50-100MB.

Why does the create clause take so long and is it possible to get around this somehow? Right now, this is the major bottleneck of my application.

Thanks.
~David

Hi David,

Create shouldn’t take longer than copyin so I would need an example to understand what’s going on. Is this code you can share or can you create a reproducing example? Is so, please either post or send to PGI Customer Support (trs@pgroup.com) and ask them to send it to me.

Thanks,
Mat

At some point, I’ll try to come up with a single example that I can send.

I dug into this a little further using the Nsight profiler/trace tool and it looks like the create clause is ultimately calling these two CUDA routines for each array.

cuMemAlloc_v2

and

cuMemHostRegister

cuMemAlloc_v2 is very fast, but cuMemHostRegister is taking over 30x longer than cuMemAlloc_v2.

So it appears that this has to do with pagelocked pinned memory? Since I am not copy the data to or from the CPU, this seems like it is not needed.

~David

Copyin calls

cuMemAlloc_v2
cuMemHostRegister
and
cuMemcpyHtoDAsync_v2

So copyin should take longer than create. It is possible that the cases I saw where copyin took a bit longer than create was a profiler/timing artifact. In most cases they are nearly identical indicating that cuMemHostRegister is taking up the vast majority of the time and overshadowing the timing of cuMemAlloc and cuMemcpyHtoDAsync.

Hi Dave,

Try setting the environment variable “PGI_ACC_SYNCHRONOUS=1”.

In the 13.x compilers we updated the run time to use pinned memory by default (which in turn helps asynchronous data movement). However, in some cases it can slow down code because the NVIDIA routines that handle pinned memory can be quite a bit slower. Though, I mostly see the slow down when freeing the memory not the allocation, hence I’m not positive it the same issue.

We are in the process of revamping how we handle asynchronous data movement so if this the same issue, then hopefully we’ll have it improved shortly.

\

  • Mat

Thanks Matt,

I actually just finished a modification that got around this issue (not in the way I wanted to because I was hoping to use a full 100% OpenACC solution and avoid CUDA) by declaring the temporary variables via

real, device :: array( … )

and then using

!$acc data deviceptr(…)

This sped up the allocation by about a factor of 3.

~David

Is there a way to turn on this program-wide PGI_ACC_SYNCHRONOUS=1 behavior from within a Fortran program via some OpenACC routine rather than having to set an environmental variable externally? I apologize if this is has an obvious answer, I have searched everywhere through the documentation and couldn’t find anything.

It looks like all of the inefficiency issues I am seeing could be fixed with the OpenACC device_resident clause. Is there a plan to implement this soon?

Hi David,

I’m not sure where we’re at on device_resident. I’m away at a conference and having challenges with email so can’t ask at the momemnt.

If you’re using Fortran, you can use the PGI Accelerator Model’s “!$acc mirror” directive which is essentially the same.

  • Mat