Subject: Inconsistent Errors with DeepStream 6.2 on dGPU: cuInit failed: 999
Hardware Platform: dGPU
DeepStream Version: 6.2
TensorRT Version: 8.5.2-1
NVIDIA GPU Driver Version: 535.54.03
Issue Description:
I am encountering inconsistent errors while running a DeepStream-based camera analytics application on my dGPU. Initially, the application runs smoothly, but after a prolonged period, it fails with the following error:
nvbufsurftransform:cuInit failed : 999
When this happens, I checked the CUDA status using torch.cuda.is_available()
and it returned False
.
I attempted to resolve the issue by running the following commands, as suggested in other discussions:
sudo modprobe --remove nvidia-uvm # same as `rmmod`
sudo modprobe nvidia-uvm
While this temporarily allows the application to run again, the issue reoccurs shortly thereafter.
Additionally, at times, the application fails with a completely different set of errors, without any changes to the source code:
WARNING: [TRT]: Unable to determine GPU memory usage
WARNING: [TRT]: Unable to determine GPU memory usage
WARNING: [TRT]: CUDA initialization failure with error: 214. Please check your CUDA installation: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
python3: ../nvdsinfer/nvdsinfer_model_builder.cpp:618: nvdsinfer::TrtModelBuilder::TrtModelBuilder(int, nvinfer1::ILogger&, const std::shared_ptr<nvdsinfer::DlLibHandle>&): Assertion `m_Builder' failed.
What confuses me is that other servers with identical configurations (hardware, software versions, and app setup) do not encounter this problem, and continue to run without issues.
Could you give me some advice? Thank you for your help!