Ncu failed with Found no NVIDIA driver on your system

Hi
I have a hard time running ncu on my system:

==PROF== Connected to process 43447 (/home/zeyu.chen/.cache/bazel/_bazel_zeyu.chen/6038ac9fefa83b1b010a60cd411239e6/external/python3_x86_64/bin/python3.9)
Traceback (most recent call last):
  File "/home/zeyu.chen/development/github.robot.car/cruise/cruise/develop/build/bin/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels.runfiles/cruise_ws/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels_exedir/__main__.py", line 126, in <module>
    main()
  File "/home/zeyu.chen/development/github.robot.car/cruise/cruise/develop/build/bin/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels.runfiles/cruise_ws/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels_exedir/__main__.py", line 122, in main
    exec(ast, clean_globals)
  File "/home/zeyu.chen/development/github.robot.car/cruise/cruise/develop/build/bin/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels.runfiles/cruise_ws/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels_exedir/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels.py", line 15, in <module>
    input_data = torch.randn(batch_size, seq_len, hidden_dim).to("cuda")
  File "/home/zeyu.chen/.../torch/cuda/__init__.py", line 314, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
==PROF== Disconnected from process 43447
==ERROR== The application returned an error code (1).

However I am able to run nsys with the program. How can I debug this?

Hi, @zeyu-chen

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from Download The Latest Official NVIDIA Drivers

---------Can you make sure your app can work properly without ncu ?

Yeah I am pretty sure it can. And I can even run nsys with my program.

That’s strange. Because this error is not reported by ncu actually.

Can you please check if this is sample specific, I mean, can you try another simple CUDA sample to see if ncu works ?

My program is not CUDA sample. I tried git cloning cuda samples and it does work.

My program is a shell script generated by bazel to run python(PyTorch). I actually tried to inject ncu directly in the execution command:

exec "env" "${env_vars[@]}" "LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH" "${gdb_command[@]}" ncu "${python_command[@]}" "$@"

but it still failed.

One thing I observed is if I remove /usr/lib/x86_64-linux-gnu above, it was ncu can’t detect the driver, maybe I am doing it wrong to link the library? I guess the issue is related to some env var setup.

Any recommendation on how to debug this? Do you need more logging to further narrow down the issue?

Hi, @zeyu-chen

There is no specific env setting required before running ncu.
If ncu ./sample (CUDA sample) works, then it means there is no issue with ncu.
Would you please check if there are some specific ENV setting cause the failure ?

Would you please check if there are some specific ENV setting cause the failure ?

Just looking at $ENV ? I am not familiar with the build system, most of my co workers are out.

I am curious how ncu looks for the coda driver when initializing, it seems like my env at least disturb the ncu startup. Should I collect some nsys log for you to take a look?

I just noticed there is an option injection-path-64, should I use it to launch my app?

Hi, @zeyu-chen

This still seems a ENV set up issue. Please check details in Nsight Compute failed to connect to the CUDA driver (stub libcuda.so[.1] on path?). This seems a similar issue.