TensorRT issue with inference from Flask server ( Cudnn Error in execute: 8 (CUDNN_STATUS_EXECUTION_FAILED)

Hi all,

I have a project almost complete that is the follow on to my previous TensorRT on Jetson Nano project (GitHub - AMLResearchProject/all-jetson-nano-classifier: An Acute Lymphoblastic Leukemia classifier developed for the NVIDIA Jetson Nano. Jetson AI Certification project by Adam Milton-Barker.) Basically this new project is essentially the same but has MQTT integration and a FLASK server that can run inference. Classification locally using TensorRT is working fine, remote inference via the FLASK server is also working fine for plain GPU inference and TFRT, but inference using TensorRT via a request through FLASK is giving me Cudnn Error in execute: 8 (CUDNN_STATUS_EXECUTION_FAILED) on the server side, is there any reason why this would happen or anything I need to account for in code to allow HTTP inference to work?

This is the full error:

ERROR: ../rtSafe/cuda/cudaPoolingRunner.cpp (211) - Cudnn Error in execute: 8 (CUDNN_STATUS_EXECUTION_FAILED)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception

Thanks in advance.

@AdamMiltonBarkerOfficial it’s hard to diagnose up-front given the potential complexities of such an application, but is the application threaded? Are you passing in CUDA memory to TensorRT?

Here is an example of a Flask server that does WebRTC, REST, and inference with TensorRT. The inferencing runs in a different thread though.

Hi Dusty,

On the server side it converts a ONNX model to TensorRT using Builder then loads the engine and waits for requests. This is the code for that, not modified much in the new version but issue is the same when using this code directly.

The difference in the new project is that the request comes from a Flask server, and the Flask endpoint calls the predict function. That is where the error happens.

The code for the server is as follows, all of our projects use a standard way of communicating so that features of our models are easily interchangeable.

I didn’t see the link you shared did you miss it out?

My suspicion at what’s happening is, is that since Flask/werkzeug uses a pool of worker threads to service requests, that the TensorRT inference is actually being run from different threads (or a thread different than what it was initialized in), and those worker thread(s) may not have a CUDA context initialized in them. I believe that pycuda.autoinit does it just for the current thread.

To begin debugging this, what I would do is print out the current thread ID in your request handler (with threading.get_ident() or similar). Try running your Flask app with threaded=False and see if that changes the behavior (it defaults to true). Then play around with PyCUDA to check the current CUDA context from the worker threads and activate/retain it if necessary.

This may be a helpful example: https://github.com/inducer/pycuda/blob/ad519613d8930dff319bf30c78c0fbf8bd9d424c/examples/from-wiki/multiple_threads.py

Sorry about that, here it is: https://github.com/dusty-nv/jetson-inference/blob/master/docs/webrtc-flask.md

As mentioned, it runs the streaming/inference in a separate thread from the request handlers. I believe that approach may be more portable when it comes to actually deploying Flask applications with a production WSGI webserver (like guniorn) which spawn multiple server processes (not just threads) and the inferencing would eventually end up in it’s own process too (that’s if you ever have the need to scale up your number of simultaneous users)

An example of where the camera streaming/inference is in it’s own process can be found here: https://github.com/dusty-nv/jetson-inference/blob/master/docs/webrtc-dash.md
In that example, the webserver request handlers communicate with the backend streaming/inference server via REST (in this case over localhost, but it could be that the webserver is actually running on a different machine than the streaming/inference process, which is another nice thing for scalability)

1 Like

Perfect thanks Dusty.