What is the meaning of Operations in Nsight Systems?

I am profiling a CUDA application. When I check the trace in Nvidia Nsight Systems UI, in the CUDA summary I get a summary of CUDA operations called. One of them is ampere_sgemm_64x64_tn. I wanted to know what the meaning of this call is. And where can I find documentation of such operations like void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::<unnamed>::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3). Thanks.

I’m going to set you up with someone with more CUDA knowledge than I have, @jcohen

But I am pretty sure that that is application code rather than NVIDIA code.

Thanks @hwilper . I forgot to mention, the application I am profiling is a PyTorch model inference.

Hi Puneeth,

From the symbol names you’ve mentioned, it looks like these are in the ATen component of PyTorch, documented here:

https://pytorch.org/cppdocs/

Unfortunately I don’t see any info there about elementwise_kernel or direct_copy_kernel_cuda. I did find the source code of gpu_kernel_impl here:

https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/CUDALoops.cuh

In general, the “CUDA operations” described in Nsight Systems are CUDA “kernel launches”, which you can find more information about in the CUDA docs. But in summary, a CUDA kernel is like a function you call from the CPU that runs on the GPU. Kernels are typically the kind of code you’d run on the CPU in very tight multidimensional loops, which the GPU can run in parallel with the loop iterations happening on different GPU cores. Going by the kernel’s name, we can guess a few things: SGEMM is the old linear-algebra-library name for single-precision generalized matrix multiply. “Ampere” is the name of an NVIDIA GPU architecture, so this kernel implementation is likely optimized for Ampere chips, and 64x64 is probably the matrix size.

I’m not sure where this matrix multiply kernel is implemented – there are many layers here. It may be that PyTorch has its own implementation, but I couldn’t find it from googling. It also may be that PyTorch is calling into other libraries to do their GEMM operations. Some examples of NVIDIA-provided GEMM libraries are:

CUBLAS - https://docs.nvidia.com/cuda/cublas/index.html
CUTLASS - https://github.com/NVIDIA/cutlass

I don’t know if I’ve answered your question, but hopefully this info sends you in a direction that will get you better answers.

Thanks for the insight @jasoncohen. I wanted to find a documentation of such functions. Pardon my beginner knowledge, but I think SGEMM calls by Nvidia are closed source right? I was also unable to find a kernel with such a name in the pytorch repository. Thanks for the pointers. I will try searching more along the direction you suggested.