Working with the Darknet/YOLO framework. If using CUDNN v8.4.1 or older, everything is fine. If we use CUDNN v8.5.0 or newer, Darknet crashes during training while calculating the mAP%.
I’m looking through someone else’s code trying to understand. At the stage where it fails, it logs this:
CUDA status Error: file: ./src/network_kernels.cu: func: network_predict_gpu() line: 735
CUDA Error: an illegal memory access was encountered
Anyone familiar with the changes that happened in CUDNN between 8.4.1 and 8.5.0 that might shed some light on this?
The file in question, line 735, is a call to cudaStreamSynchronize(): darknet/network_kernels.cu at master · AlexeyAB/darknet · GitHub
(CUDNN error at training iteration 1000 when calculating mAP% · Issue #8669 · AlexeyAB/darknet · GitHub)
One of the things I’m seeing in the 8.5.0 release notes is this text:
A buffer was shared between threads and caused segmentation faults. There was previously no way to have a per-thread buffer to avoid these segmentation faults. The buffer has been moved to the cuDNN handle. Ensure you have a cuDNN handle for each thread because the buffer in the cuDNN handle is only for the use of one thread and cannot be shared between two threads.
This looks promising. If the different threads didn’t have unique cuDNN handles and previously shared one, would this cause a the error we’re seeing, “an illegal memory access was encountered”?
What API do I need to search for that would show where threads obtain a cuDNN handle?
Hi @stephanecharette ,
Apologies for delayed response.
Can you please try running your workload with compute-sanitizer
to get the kernel causing the IMA.
If it’s a cuDNN kernel, plesae create a cuDNN API log: Developer Guide - NVIDIA Docs and share it with us.
Thanks