Our application loads multiple models, caffe and onnx (pytorch). recently we added an onnx model based on resnext 101. adding this model causes the application to get cuda error 719 between 10 minutes to 1 hour from the start of the run.
Application is using TensorRT 188.8.131.52, cuda 10.1 and cudnn 184.108.40.206. issue appears on Windows only. Linux does not have it.
observed on multiple GPU’s, turing and pascal, and multiple driver versions.
the application uses a single thread to launch all the models. each model is launched asynchronously. all models share the same cuda stream.
How can I determine what is the issue? this model alone runs fine in trtexec but it seems like it’s behaving unexpectedly when coupled with one of our other models.