Unable to map kernel names between NSYS and NCU reports

Hello,

I am trying to obtain SM utilization traces. However, I have not found an existing tool that provides this functionality. As a workaround, I am attempting to map nsys and ncu together to get an SM utilization and memory bandwidth utilization trace.

ncu provides per-kernel SM utilization, while nsys gives a GPU trace with start times and durations for each kernel. I use the kernel names to map ncu SM and memory bandwidth data with nsys start and end times.

However, this process leads to several issues:

  1. Instance ID Mapping: When there is an instance with an instance ID, there can be multiple potential matches in the ncu list, and I am unsure how to handle these.

NCU:
void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T3, T4, T5, T6) (960, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 1, CC 8.0
void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(long) (instance 1)], at::detail::Array<char *, (int)2>, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T3, T4, T5, T6) (29, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 1, CC 8.0
Nsys:
void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 10)]::operator ()() const::[lambda(c10::Half) (instance 1)], at::detail::Array<char *, (int)2>, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T3, T4, T5, T6)

Observe here that the instance IDs are different across NSYS and NCU.

  1. Kernel Name Variations: There are slight differences in the implementation of the same kernel across nsys and ncu.

NCU:
void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<c10::Half, c10::Half, c10::Half, at::native::binary_internal::MulFunctor>, at::detail::Array<char *, (int)2>>(int, T2, T3) (2400, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 1, CC 8.0
void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<c10::Half, c10::Half, c10::Half, at::native::binary_internal::MulFunctor>, at::detail::Array<char *, (int)2>>(int, T2, T3) (128, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 1, CC 8.0
Nsys:
void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::native::AUnaryFunctor<c10::Half, c10::Half, c10::Half, at::native::binary_internal::MulFunctor>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)

  1. Unmapped Kernels: Some kernels, such as sinh, cosh, and exph, appear in nsys but not in ncu. These run for very short periods, and I do not understand why they are present in one but not the other. How do I map these?

I am running the tools with the following configuration:

ncu --target-processes all --launch-count 1 --filter-mode per-launch-config --device 0 -f -o <output_file> python script.py …

nsys profile -x true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --capture-range-end=stop-shutdown -w true --output <output_file> --gpu-metrics-device=0 python script.py …

Is there a better way to solve my original problem of obtaining SM utilization curves, or can you suggest improvements to make this workaround more effective?

All recent versions of Nsight Compute and Nsight Systems use the same code for demangling kernel names. If you are seeing differences in the names, my assumption would be that you are comparing different kernels. From the data you provided, one can draw no conclusion to the opposite, as it doesn’t show the list of kernels reported by each tool, nor the version of each tool, etc. In your command, you are instructing ncu to collect only one instance of each kerne/launch configuration pair, so it’s not clear why you expect this list to match the nsys output?

Some kernels, such as sinh, cosh, and exph, appear in nsys but not in ncu. These run for very short periods, and I do not understand why they are present in one but not the other. How do I map these?

Where are these kernels coming from? ncu does not actively filter any kernels other than what was configured as its options.

To see the SM utilization over time for a single kernel or a range, you can use Nsight Compute’s PM Sampling feature, e.g. by collecting --set full or --section PmSampling. If you require the info for a range, you will need to select the appropriate replay mode and define the necessary range boundaries.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.