Nsight Profile of NVIDIA/CUDALibrarySamples/cuTENSOR. Does it use

To begin with, my main question is: when profiling a CUDA application, is there any way to differentiate CUDA core usage versus Tensor core usage?

To try and answer this, I decided to start with the samples from
Nvidia’s cuTENSOR

using Nsight Compute’s GUI. Enabling “Compute Workload Analysis” option shows utilization for the various execution pipelines of the SMs. Curious how the Tensor pipeline was not used at all. I understand, Tensor cores are reserved for mixed precision operations, and I haven’t checked the data types used by the profiled application. However, I assumed the cuTENSOR library was an API explicit for leveraging Tensor core processing.

A snapshot of the Nsight’s Workload Analysis:

  1. Does the line for “Tensor” necessarily indicate the Tensor core usage?

  2. This profile is of the sample contraction.cu. Does it not require Tensor cores?

  3. Does the cuTENSOR library not necessarily leverage Tensor cores?

  4. Also, any tips on profiling CUDA core usage versus Tensor core usage?

Any insight or links to other resources to gain some insight would be much appreciated!

OS: CentOS 7
CUDA: 11.5
GPU: Tesla V100


I haven’t studied it carefully, but at first glance it appears to be using float data types, which is FP32. There isn’t any tensorcore support for FP32 on any current CUDA GPUs. Tensorcore support on the latest GPUs includes FP64, FP16, FP8, INT8, and others (INT4, TF32 etc.)

There may be tensorcore usage if the data types are appropriate. Appropriate data types will depend on the GPU you are using, but for example FP32 will not use tensorcores.

There are also metrics you can ask the profiler for, which can indicate tensorcore usage. There is a blog here, as well as various questions on this topic on various forums.

Questions specifically about cuTENSOR should be posted on the libraries forum.

For profiler-specific questions, there are profiler forums.

1 Like

Thank you. This information was extremely helpful. For a followup question, are the Warp Matrix Functions the ONLY way to access Tensor unit operations?

You can access tensor core functionality via:

  • CUDA C++ intrinsics from the link you provided
  • PTX
  • Various CUDA libraries, including CUBLAS

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.