I’ve noticed that the data create(…) clause is very slow, nearly identical to data copyin(…) and in some cases it is even slower. This is tested separately on a Geforce 2000M and also a Geforce Titan Windows 7 with PGI 13.6 64bit using the CUDA 5 runtime.
I’m a bit surprised by this because from what I understand the create clause should just be allocating memory on the GPU without any data copies (as per the OpenACC specification page 17) whereas the copyin clause both allocates memory and copies data from the CPU to the GPU. The size of the arrays I’m allocating is around 50-100MB.
Why does the create clause take so long and is it possible to get around this somehow? Right now, this is the major bottleneck of my application.