What is the meaning of Operations in Nsight Systems?

puneethnaik · October 28, 2022, 11:00am

I am profiling a CUDA application. When I check the trace in Nvidia Nsight Systems UI, in the CUDA summary I get a summary of CUDA operations called. One of them is ampere_sgemm_64x64_tn. I wanted to know what the meaning of this call is. And where can I find documentation of such operations like void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::<unnamed>::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3). Thanks.

hwilper · October 28, 2022, 5:04pm

I’m going to set you up with someone with more CUDA knowledge than I have, @jcohen

But I am pretty sure that that is application code rather than NVIDIA code.

puneethnaik · October 29, 2022, 11:57am

Thanks @hwilper . I forgot to mention, the application I am profiling is a PyTorch model inference.

jasoncohen · October 31, 2022, 5:17pm

Hi Puneeth,

From the symbol names you’ve mentioned, it looks like these are in the ATen component of PyTorch, documented here:

https://pytorch.org/cppdocs/

Unfortunately I don’t see any info there about elementwise_kernel or direct_copy_kernel_cuda. I did find the source code of gpu_kernel_impl here:

https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/CUDALoops.cuh

In general, the “CUDA operations” described in Nsight Systems are CUDA “kernel launches”, which you can find more information about in the CUDA docs. But in summary, a CUDA kernel is like a function you call from the CPU that runs on the GPU. Kernels are typically the kind of code you’d run on the CPU in very tight multidimensional loops, which the GPU can run in parallel with the loop iterations happening on different GPU cores. Going by the kernel’s name, we can guess a few things: SGEMM is the old linear-algebra-library name for single-precision generalized matrix multiply. “Ampere” is the name of an NVIDIA GPU architecture, so this kernel implementation is likely optimized for Ampere chips, and 64x64 is probably the matrix size.

I’m not sure where this matrix multiply kernel is implemented – there are many layers here. It may be that PyTorch has its own implementation, but I couldn’t find it from googling. It also may be that PyTorch is calling into other libraries to do their GEMM operations. Some examples of NVIDIA-provided GEMM libraries are:

CUBLAS - https://docs.nvidia.com/cuda/cublas/index.html
CUTLASS - https://github.com/NVIDIA/cutlass

I don’t know if I’ve answered your question, but hopefully this info sends you in a direction that will get you better answers.

puneethnaik · November 2, 2022, 3:24pm

Thanks for the insight @jasoncohen. I wanted to find a documentation of such functions. Pardon my beginner knowledge, but I think SGEMM calls by Nvidia are closed source right? I was also unable to find a kernel with such a name in the pytorch repository. Thanks for the pointers. I will try searching more along the direction you suggested.

system · December 17, 2022, 10:33am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Call graph view in Nsight Compute Nsight Compute	13	1531	October 12, 2021
Why torch.Tensor.cuda() utilizes GPU? CUDA Programming and Performance	9	1112	June 2, 2022
Nsight compute fail to profile L20 gpu CUDA Programming and Performance	7	667	April 11, 2024
cublasSgemm results in null matrix CUDA Programming and Performance	5	758	May 28, 2019
`ncu` "No kernels profiled" Nsight Compute	6	2293	September 29, 2022
Call stack is visible/captured only for some CUDA kernels (broken backtraces) Profiling Linux Targets	5	1464	December 29, 2022
NSight Systems does not profile subprocess(via fork in unistd or Process in python.multiprocess) CUDA_API Profiling Linux Targets	6	1286	September 23, 2024
Ampere_sgemm_128x128_nn CUDA Programming and Performance	1	1750	March 16, 2022
Crash when profiling with "Kernel Launches and Memory Operations" Nsight Visual Studio Edition	7	3625	February 5, 2015
Cannot profile kernel from CUDA samples Nsight Compute	6	487	May 31, 2023

What is the meaning of Operations in Nsight Systems?

Related topics