Nsight Profile of NVIDIA/CUDALibrarySamples/cuTENSOR. Does it use

user29588 · October 24, 2022, 8:20pm

To begin with, my main question is: when profiling a CUDA application, is there any way to differentiate CUDA core usage versus Tensor core usage?

To try and answer this, I decided to start with the samples from
Nvidia’s cuTENSOR

using Nsight Compute’s GUI. Enabling “Compute Workload Analysis” option shows utilization for the various execution pipelines of the SMs. Curious how the Tensor pipeline was not used at all. I understand, Tensor cores are reserved for mixed precision operations, and I haven’t checked the data types used by the profiled application. However, I assumed the cuTENSOR library was an API explicit for leveraging Tensor core processing.

A snapshot of the Nsight’s Workload Analysis:

Does the line for “Tensor” necessarily indicate the Tensor core usage?
This profile is of the sample contraction.cu. Does it not require Tensor cores?
Does the cuTENSOR library not necessarily leverage Tensor cores?
Also, any tips on profiling CUDA core usage versus Tensor core usage?

Any insight or links to other resources to gain some insight would be much appreciated!

OS: CentOS 7
CUDA: 11.5
GPU: Tesla V100

Robert_Crovella · October 25, 2022, 9:39pm

Yes.

I haven’t studied it carefully, but at first glance it appears to be using float data types, which is FP32. There isn’t any tensorcore support for FP32 on any current CUDA GPUs. Tensorcore support on the latest GPUs includes FP64, FP16, FP8, INT8, and others (INT4, TF32 etc.)

There may be tensorcore usage if the data types are appropriate. Appropriate data types will depend on the GPU you are using, but for example FP32 will not use tensorcores.

There are also metrics you can ask the profiler for, which can indicate tensorcore usage. There is a blog here, as well as various questions on this topic on various forums.

Questions specifically about cuTENSOR should be posted on the libraries forum.

For profiler-specific questions, there are profiler forums.

user29588 · November 7, 2022, 5:35pm

Thank you. This information was extremely helpful. For a followup question, are the Warp Matrix Functions the ONLY way to access Tensor unit operations?

Robert_Crovella · November 8, 2022, 3:11pm

You can access tensor core functionality via:

CUDA C++ intrinsics from the link you provided
PTX
Various CUDA libraries, including CUBLAS

system · November 22, 2022, 3:12pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How can I get the utilization of cuda core and tensor core respectively? Profiling Linux Targets	5	2877	January 10, 2023
How can I prevent my customized CUDA kernel function from using tensor cores on a Jetson Orin device? Jetson AGX Orin cuda , kernel	19	963	February 5, 2024
How to measure Tensor core utilization using NVIDIA profiling tools such as Nsight System, DLProf, nvprof etc TensorRT cudnn	4	1451	January 31, 2024
Can you use nsight to see tensor core occupancy? Nsight Compute cudnn	4	918	March 23, 2024
How to confirm Tensor Core is working or not in CuSPARSE GPU-Accelerated Libraries cuda	4	830	May 12, 2023
Tensor Cores Jetson AGX Xavier	8	1282	October 18, 2021
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	1870	May 15, 2024
Nvprof profiling CUDA Core integer usage Visual Profiler and nvprof deep-learning-profiler , profiling	0	1135	July 19, 2021
How to get Nsight Compute timeline of tensor cores and cuda cores? Nsight Compute cuda , kernel	5	720	April 16, 2024
Tensor core metrics not showing up in NSight? Profiling Linux Targets pytorch	9	2882	May 18, 2024

Nsight Profile of NVIDIA/CUDALibrarySamples/cuTENSOR. Does it use

Related topics