GPU memory allocation and deallocation speed

Hi, all:

I’m using PGI visual profiler to profile my application on tesla K40 GPU on windows 10. My code is compiled with PGI fortran compiler 19.0 community version. Please see the attached image of the screen captured from the PGI visual profiler. To my surprise, the GPU memory allocation and deallocation takes about one third time of the total running time of the application. I have several questions:

  1. In general how fast is GPU memory allocation and deallocation ?
  2. Are all dynamically allocated variables automatically initialized to zero?
  3. If memory allocation and deallocation is slow, how can I overlap these operations with other computing operations?
  4. It is strange that I don’t deallocate any memory in the beginning part of myapplication, why are there still two long cudaFree calls?

Thanks in advance.

John

Hi John,

In general how fast is GPU memory allocation and deallocation ?

In general, it shouldn’t take a significant amount of time, at least no longer than calling malloc. Though it’s been many years since I used a K40, so I don’t remember offhand if the long cudeFree time should be expected of not

Are all dynamically allocated variables automatically initialized to zero?

No. Though if using OpenACC, you can set the environment variable “PGI_ACC_FILL=1” and the compiler will initialze allocated variable to zero or whatever the user defines with “PGI_ACC_FILL_VALUE=”.

If memory allocation and deallocation is slow, how can I overlap these operations with other computing operations?

cudaMalloc/cudaFree are blocking calls, though memcpy calls do have asynchronous versions.

It is strange that I don’t deallocate any memory in the beginning part of myapplication, why are there still two long cudaFree calls?

Agreed, it is strange. Though without further details about your code, I can’t really offer any ideas as to why they’re there. Do you have reproducing example?

Hi, Mat:

Thanks for your prompt reply. I allocated all arrays with cuda fortran allocate statement, not with OpenACC create clause. Does the PGI_ACC_FILL=1 work in this case? If not, what could be set to automatically initialize the arrays to zero?

It’s too bad to know that the cudaMalloc and cudaFree are blocking calls. That means I have to find creative ways to hide/minimize the GPU memory allocation and deallocation.

Another weird thing I noticed is that the deallocation of all arrays at the end of my application doesn’t show any significant time.

Thanks,

John

No, but it’s easy to use array syntax to initialize the array just after you allocate, i.e. “dArr=0.0”. I would assert that explicitly initializing arrays is a better programming practice, rather than relying on compiler extensions and features. Things like “PGI_ACC_FILL” are really only useful for debugging anyway.