I am trying to obtain the trace of an operation with nsight systems profiler, which is being run on a device with MIG based partition. The command
user.name@server:~/home/user$ nsys profile --trace=cuda -o opfile ./app
results in the following error
Importer error status: Importation succeeded with non-fatal errors.
**** Analysis failed with:
Status: TargetProfilingFailed
Props {
Items {
Type: DeviceId
Value: "Local (CLI)"
}
}
Error {
Type: RuntimeError
SubError {
Type: ProcessEventsError
Props {
Items {
Type: ErrorText
Value: "/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Host/Analysis/EventHandler/TraceEventHandler.cpp(626): Throw in function void QuadDAnalysis::EventHandler::TraceEventParser::operator()(const QuadDCommon::FlatComm::Cuda::Event&)\nDynamic exception type: boost::wrapexcept<QuadDCommon::InternalErrorException>\nstd::exception::what: InternalErrorException\n[QuadDCommon::tag_message*] = Unrecognized GPU UUID: 8d1ffbc7-1d0a-b681-faa3-ce4c66e6abf2\n"
}
}
}
}
The chosen device was 5 as can be seen from the result below.
user.name@server:~/home/user$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-41fc493b-8942-679f-e64a-a5e368ef05ae)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-a5b64c40-f8e1-41d1-f852-f241ee3d0e86)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-65baebe7-f531-77b5-4fb4-5f35437ee26b)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-6490c60f-677a-8101-ffc1-647365643059)
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-45fe2b48-dd39-0bf6-b8ee-2b290c2befc5)
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-bb4d25c5-dd1b-70cf-9f1f-ee713da12527)
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-8d1ffbc7-1d0a-b681-faa3-ce4c66e6abf2)
MIG 3g.40gb Device 0: (UUID: MIG-5bcbb609-6905-5a64-a742-13569e389a8e)
MIG 3g.40gb Device 1: (UUID: MIG-7ec5d651-b30b-54bb-9e56-0535eb3a18ae)
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1d3bc417-a301-0113-4ea2-6c8d581ed042)
MIG 3g.40gb Device 0: (UUID: MIG-0e6c5df4-a22d-5b9f-b800-cc92dde3ce2c)
MIG 3g.40gb Device 1: (UUID: MIG-29b38305-fb19-51a6-a41a-8fc139d30abc)
Even though the environment variable “CUDA_VISIBLE_DEVICES” was set to 5, the error mentioned above corresponds to the MIG device 6.
user.name@server:~/home/user$ echo $CUDA_VISIBLE_DEVICES
5
I didn’t encounter this problem on a server without MIG. Any suggestions for resolving this issue?