Hello,
I am trying to obtain SM utilization traces. However, I have not found an existing tool that provides this functionality. As a workaround, I am attempting to map nsys and ncu together to get an SM utilization and memory bandwidth utilization trace.
ncu provides per-kernel SM utilization, while nsys gives a GPU trace with start times and durations for each kernel. I use the kernel names to map ncu SM and memory bandwidth data with nsys start and end times.
However, this process leads to several issues:
- Instance ID Mapping: When there is an instance with an instance ID, there can be multiple potential matches in the ncu list, and I am unsure how to handle these.
NCU:
void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T3, T4, T5, T6) (960, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 1, CC 8.0
void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(long) (instance 1)], at::detail::Array<char *, (int)2>, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T3, T4, T5, T6) (29, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 1, CC 8.0
Nsys:
void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 10)]::operator ()() const::[lambda(c10::Half) (instance 1)], at::detail::Array<char *, (int)2>, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T3, T4, T5, T6)
Observe here that the instance IDs are different across NSYS and NCU.
- Kernel Name Variations: There are slight differences in the implementation of the same kernel across nsys and ncu.
NCU:
void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<c10::Half, c10::Half, c10::Half, at::native::binary_internal::MulFunctor>, at::detail::Array<char *, (int)2>>(int, T2, T3) (2400, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 1, CC 8.0
void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<c10::Half, c10::Half, c10::Half, at::native::binary_internal::MulFunctor>, at::detail::Array<char *, (int)2>>(int, T2, T3) (128, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 1, CC 8.0
Nsys:
void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::native::AUnaryFunctor<c10::Half, c10::Half, c10::Half, at::native::binary_internal::MulFunctor>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)
- Unmapped Kernels: Some kernels, such as sinh, cosh, and exph, appear in nsys but not in ncu. These run for very short periods, and I do not understand why they are present in one but not the other. How do I map these?
I am running the tools with the following configuration:
ncu --target-processes all --launch-count 1 --filter-mode per-launch-config --device 0 -f -o <output_file> python script.py …
nsys profile -x true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --capture-range-end=stop-shutdown -w true --output <output_file> --gpu-metrics-device=0 python script.py …
Is there a better way to solve my original problem of obtaining SM utilization curves, or can you suggest improvements to make this workaround more effective?