==PROF== Connected to process 43447 (/home/zeyu.chen/.cache/bazel/_bazel_zeyu.chen/6038ac9fefa83b1b010a60cd411239e6/external/python3_x86_64/bin/python3.9)
Traceback (most recent call last):
File "/home/zeyu.chen/development/github.robot.car/cruise/cruise/develop/build/bin/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels.runfiles/cruise_ws/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels_exedir/__main__.py", line 126, in <module>
main()
File "/home/zeyu.chen/development/github.robot.car/cruise/cruise/develop/build/bin/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels.runfiles/cruise_ws/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels_exedir/__main__.py", line 122, in main
exec(ast, clean_globals)
File "/home/zeyu.chen/development/github.robot.car/cruise/cruise/develop/build/bin/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels.runfiles/cruise_ws/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels_exedir/cruise/mlp/robotorch2/experimental/benchmark_liger_kernels.py", line 15, in <module>
input_data = torch.randn(batch_size, seq_len, hidden_dim).to("cuda")
File "/home/zeyu.chen/.../torch/cuda/__init__.py", line 314, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
==PROF== Disconnected from process 43447
==ERROR== The application returned an error code (1).
However I am able to run nsys with the program. How can I debug this?
One thing I observed is if I remove /usr/lib/x86_64-linux-gnu above, it was ncu can’t detect the driver, maybe I am doing it wrong to link the library? I guess the issue is related to some env var setup.
There is no specific env setting required before running ncu.
If ncu ./sample (CUDA sample) works, then it means there is no issue with ncu.
Would you please check if there are some specific ENV setting cause the failure ?
Would you please check if there are some specific ENV setting cause the failure ?
Just looking at $ENV ? I am not familiar with the build system, most of my co workers are out.
I am curious how ncu looks for the coda driver when initializing, it seems like my env at least disturb the ncu startup. Should I collect some nsys log for you to take a look?