CUDNN crash when CUDA allocate ratio is high for use by self made memory manager

Keep on crashing on CUDNN error when I allocate the very large CUDA malloc.
I am doing my own memory management for DNN.
I allocate from CUDA very large chunk and manage it myself.
When I round the allocation chunk to the next GB below all is working well.
When I try to squeeze more memory out of CUDA, CUDNN fails with error.
Not yet clear what the root cause but it increasingly looks like it could be that CUDNN needs extra memory beyond what I allocate and fail on insufficient memory.
Will you please suggest how to debug such issue.
How can I root cause why DNN fail.
Does it allocate more memory on its own. How much, how to find an upper limit?

terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDNN_STATUS_INTERNAL_ERROR
*** Aborted at 1586477971 (unix time) try “date -d @1586477971” if you are using GNU date ***
PC: @ 0x7ff620634e97 gsignal
*** SIGABRT (@0x46a6422c0000e5af) received by PID 58799 (TID 0x7ff65972a380) from PID 58799; stack trace: ***
@ 0x7ff658623890 (unknown)
@ 0x7ff620634e97 gsignal
@ 0x7ff620636801 abort
@ 0x7ff62125857d __gnu_cxx::__verbose_terminate_handler()
@ 0x7ff621256556 __cxxabiv1::__terminate()
@ 0x7ff6212565a1 std::terminate()
@ 0x7ff621256806 __cxa_rethrow

Thank you,
Gilad

Hi,

Could you please verify how much GPU memory you have and how much are getting allocated.
I believe there is some percentage (maybe let’s say 5%) of GPU memory that is usually left unused in order to allow sufficient memory for use by the driver.

Also, you can use API logging and NVIDIA profiler to debug the issue further.
https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#api-logging

Thanks