Query the number of tensor cores on the GPU?

Is there any CUDA API available? Seems like cudaDeviceGetAttribute is not able to get the tensor core count.

Haven’t got any replies yet

No API exists for that, at least not at the level of the CUDA runtime. See content of cudaDeviceProp: CUDA Runtime API :: CUDA Toolkit Documentation

Why do you need to know? In my thinking, this should be an implementation artifact transparent to CUDA applications.

We want to calculate the maximum tensor core perf we could get on different GPUs

And why is that? I sense a possible XY-problem here.

FWIW, the number of tensor cores does not necessarily translate straight into some theoretical speed-of-light performance number, as there could be multiple different kinds of tensor cores, each with different performance characteristics.

  1. We know exactly the data types we are using
  2. We hard code the number of tensor cores for each data type
  3. We calculate the peak perf

Now that we want to use an API for step 2 instead of hard coding

I think I expressed myself poorly. Let me try again:

Why do you want/need to “calculate the peak perf”? What does the application do with that data?

In any event, if you would like to propose changes to CUDA APIs, you can always file an enhancement request with NVIDIA (via the bug reporting page).

Why do you want/need to “calculate the peak perf”? What does the application do with that data?

We are not building applications. We are working on the triton compiler, which targets optimal perf for kernels that use tensor cores. We capture all data types when compiling the kernel. So for testing purpose, we want to know what’s the gap between the current perf and the theoretical perf peak.

The legacy solution, like I mentioned, reads the GPU arch (e.g., sm_70) and looks up a configuration table we put together.

Thanks for the explanation.

I am not sure comparing against some theoretical never-to-be-exceeded speed-of-light value is the best way to track the performance of compiled kernels. There are various hardware-related factors that could cause any particular kernel to fall far short of that mark that a compiler has no control over. An example would be setting HPCG performance in relation to theoretical FLOPS, which has no practical relevance in my book.

Tracking test kernel performance over time (through compiler versions) and making sure it is monotonically increasing would be my preference. I have worked with compiler developers fairly extensively. But I accept that different philosophies exist.

Thanks for the reply!

To be clear, we are not using “comparing against some theoretical” to test every GPU kernel we are working on. In addition, we are also computing arithmetic intensity to get rooflines, but unlike nsight-compute, we want to get tensor core rooflines but not only fp32/fp64 rooflines. (Not sure if ncu has tensor core rooflines now, but if it has, I suppose nvidia should expose the API).

Creating a roof-line model makes sense, of course. This is good background information you would want to include should you decide to file an RFE (enhancement request) with NVIDIA.

Got you. Thanks for letting me know.