I’m using PGI visual profiler to profile my application on tesla K40 GPU on windows 10. My code is compiled with PGI fortran compiler 19.0 community version. Please see the attached image of the screen captured from the PGI visual profiler. To my surprise, the GPU memory allocation and deallocation takes about one third time of the total running time of the application. I have several questions:
- In general how fast is GPU memory allocation and deallocation ?
- Are all dynamically allocated variables automatically initialized to zero?
- If memory allocation and deallocation is slow, how can I overlap these operations with other computing operations?
- It is strange that I don’t deallocate any memory in the beginning part of myapplication, why are there still two long cudaFree calls?
Thanks in advance.