CUDA libraries hang time after time in our NN modules. It happens rare, but does not stop happening. Unfortunately, we are unable to figure out concrete conditions when it happens.
The good news is, we have a system, which makes our applications crash with minidumps if they hang. Since you can have sources/pdbs of your drivers and cuda libraries, could you please take a look at such example dump, I provide in the attachment to see what actually happened there?
The version of CUDA is 10.2
The version of driver is 451.67
Caffe2Server_minimal1.zip (17.3 KB)