i try to use ncu to check the gpu usage on a simple function. but i get error. i am runing in container ARG CUDA_VERSION=12.1.0
ARG CUDNN_VERSION=8
ARG OS_VERSION=22.04
FROM nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${OS_VERSION}
my device is single 3090 GPU , ncu -v 2023.1.0.0 (build 32376155) (public-release)
my container already start with cap_add: - SYS_ADMIN -CAP_SYS_PTRACE
$ nvidia-smi
NVIDIA-SMI 560.35.02 Driver Version: 560.94 CUDA Version: 12.6
my local is also install with cuda 12.1 (same with container)
i check the document and ampere should be support after 2022 update and the container start with admin
,my windows is 11 and i install the ncu ubuntu version so i dont know what else may cause this issue?
import torch
def test_cuda():
device = torch.device("cuda:0")
x = torch.randn(1000, 1000, device=device)
y = torch.matmul(x, x)
torch.cuda.synchronize() # Ensure GPU computation completes
print(y)
if __name__ == "__main__":
test_cuda()
ncu python3 test_ncn.py
==PROF== Connected to process 7698 (/usr/bin/python3.10)
==ERROR== Unknown Error on device 0.
tensor([[ -1.6078, 3.2960, -25.2283, ..., -39.3344, 19.4053, 12.0484],
[-10.6675, -40.8973, -17.8383, ..., -44.4900, 55.2061, -15.0428],
[-13.4432, 18.6459, -2.8020, ..., 22.4246, 31.6181, 21.3463],
...,
[ -9.2275, -18.6414, -3.8433, ..., -36.3853, 19.7588, 44.5568],
[-14.8734, 46.8846, 21.3486, ..., -1.1413, 49.1632, 28.2038],
[-23.3245, -21.7096, 0.9777, ..., -40.5297, 30.1500, 20.3629]],
device='cuda:0')
==PROF== Disconnected from process 7698
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.