On the server side it converts a ONNX model to TensorRT using Builder then loads the engine and waits for requests. This is the code for that, not modified much in the new version but issue is the same when using this code directly.
The difference in the new project is that the request comes from a Flask server, and the Flask endpoint calls the predict function. That is where the error happens.
The code for the server is as follows, all of our projects use a standard way of communicating so that features of our models are easily interchangeable.
I didn’t see the link you shared did you miss it out?
My suspicion at what’s happening is, is that since Flask/werkzeug uses a pool of worker threads to service requests, that the TensorRT inference is actually being run from different threads (or a thread different than what it was initialized in), and those worker thread(s) may not have a CUDA context initialized in them. I believe that pycuda.autoinit does it just for the current thread.
To begin debugging this, what I would do is print out the current thread ID in your request handler (with threading.get_ident() or similar). Try running your Flask app with threaded=False and see if that changes the behavior (it defaults to true). Then play around with PyCUDA to check the current CUDA context from the worker threads and activate/retain it if necessary.
As mentioned, it runs the streaming/inference in a separate thread from the request handlers. I believe that approach may be more portable when it comes to actually deploying Flask applications with a production WSGI webserver (like guniorn) which spawn multiple server processes (not just threads) and the inferencing would eventually end up in it’s own process too (that’s if you ever have the need to scale up your number of simultaneous users)
An example of where the camera streaming/inference is in it’s own process can be found here: https://github.com/dusty-nv/jetson-inference/blob/master/docs/webrtc-dash.md
In that example, the webserver request handlers communicate with the backend streaming/inference server via REST (in this case over localhost, but it could be that the webserver is actually running on a different machine than the streaming/inference process, which is another nice thing for scalability)