Hi,
I am using a system with 4 A100s, the operating system is Ubuntu 20.04LTS, CUDA version 11.8 and a driver version 525.89.02.
I intend to use DLProf to profile a MONAI training. I am launching the training (so far on a single GPU) as:
dlprof --mode pytorch --reports summary --formats json --output_path ./outputs_base python3 Dense_UNet_Training_v0.0.py*
The training completes OK, but errors are generated by DLprof and I cannot open the log files with NSYS Compute.
Specifically the error is:
Error {
Type: RuntimeError
SubError {
Type: ProcessEventsError
Props {
Items {
Type: ErrorText
Value: “/build/agent/work/20a3cfcd1c25021d/QuadD/Host/Analysis/Modules/TraceProcessEvent.cpp(45): Throw in function const string& {anonymous}::GetCudaCallbackName(bool, uint32_t, const QuadDAnalysis::MoreInjection&)\nDynamic exception type: boost::exception_detail::clone_implQuadDCommon::InvalidArgumentException\nstd::exception::what: InvalidArgumentException\n[QuadDCommon::tag_message*] = Unknown driver API function index: 673\n”
}
}
}
}
the above would seem to indicate a mismatch between the API of the driver and of DLProf?
Do you have any suggestion?
Thanks for any help,
Andrea