Can I get the number of Tensor cores of my GPU?

Eric_Song · December 12, 2022, 4:19pm

Hi, guys,
I want to know if I can get the number of Tensor cores of my GPU with CUDA API.

Your answer and guidance will be appreciated!

Robert_Crovella · December 12, 2022, 4:31pm

There isn’t any API to do that.

Eric_Song · December 12, 2022, 4:34pm

Thank you sincerely for the response!

Eric_Song · December 14, 2022, 3:25pm

Hi, @Robert_Crovella ,
since I cannot get Tensor cores number with CUDA API, would it be explicitly described in some official documentation?

Robert_Crovella · December 14, 2022, 3:34pm

There are various architecture whitepapers that indicate the number of tensor cores (TC). I won’t be able to give you a laundry list of all of them, and its quite possible that this method doesn’t cover every possible GPU that has TC. You would have to come up with your own calculations based on knowing how many TC there are per SM for the specific architecture of your GPU, then multiply that by the number of SMs in your GPU.

Here is an example. The GA102 whitepaper specifically covers for example RTX3090, which is a sm_86 GPU. Table 9 in that document indicates both the number of TC per SM as well as the TC per GPU. If you have a 3090, then TC per GPU indicated there is the answer. However if you have a 3080, or some other sm_86 architecture GPU, you could multiply the TC per SM number there by the number of SMs in your sm_86 GPU, and get the answer that way.

Although it doesn’t cover TC, a similar methodology can be used to calculate CUDA cores per GPU, and a sample method is given here for C++ and here for python. A similar method could be used for TC. You would have to come up with the multiplication factors for each sm type, perhaps by studying the relevant architecture whitepapers.

cnmcdee · December 14, 2022, 4:37pm

It should be added to the cudaGetDeviceProperties because that function gets a LOT of information:

The properties I am dumping from the call:

It would be suggested to step this code might hold the tensor core parameter in the latest toolkit if you stepped it with a card that has Tensor cores? The 1660 does not.

void Cuda_device_poll(cudaDeviceProp* props, int device_count)
{
    cudaGetDeviceCount(&device_count);
    if (device_count > 0)
      for(int t = 0; t < device_count; t++)
    {
        printf("Found %d cuda device...\n", device_count);
        cudaGetDeviceProperties(&props[t], 0);
    }
    if (device_count == 0)
    {
        printf("No cuda devices found exiting...\n");
        exit(-1);
    }
}

void Cuda_print_properties(cudaDeviceProp* prop)
{
    int card_count  = Cuda_device_count();
    for (int t = 0; t < card_count; t++)
    {
        cudaDeviceProp a = prop[t];
        printf("Device: %d [%s] \n", t, a.name);
        unsigned char uuid[16] = {0};
        for (int u = 0; u < 16; u++)
        {
           uuid[u] = a.uuid.bytes[u];
        }
        printf("               UUID: %x%x%x%x-%x%x%x%x-%x%x%x%x-%x%x%x%x\n",uuid[0], uuid[1], uuid[2], uuid[3], uuid[4], uuid[5], uuid[6], uuid[7], uuid[8],
               uuid[9], uuid[10], uuid[11], uuid[12], uuid[13], uuid[14], uuid[15]);
        unsigned char luid[8] = {0};
        for (int u = 0; u < 8; u++)
        {
            luid[u] = a.luid[u];
        }
        printf("               LUID: %x%x%x%x-%x%x%x%x\n", luid[0], luid[1], luid[2], luid[3], luid[4], luid[5], luid[6], luid[7]);

        printf(" luidDeviceNodeMask: %u\n", a.luidDeviceNodeMask);
        char metric[20] = {0};
        size_t total_mem = a.totalGlobalMem;
        char_to_NB(total_mem, metric);
        printf("     totalGlobalMem: %ld\n", a.totalGlobalMem);
        printf("  sharedMemPerBlock: %zu\n", a.sharedMemPerBlock);
        printf("       regsPerBlock: %d\n", a.regsPerBlock);
        printf("           warpSize: %d\n", a.warpSize);
        printf("           memPitch: %zu\n", a.memPitch);
        printf("          clockRate: %d\n", a.clockRate);
        printf("    memoryClockRate: %d\n", a.memoryClockRate);
        printf("     memoryBusWidth: %d\n", a.memoryBusWidth);
        printf(" maxThreadsPerBlock: %d\n", a.maxThreadsPerBlock);
        printf("maxThreadsPerMultiP: %d\n", a.maxThreadsPerMultiProcessor);
        printf("ShrdMemPerMultiProc: %zu\n", a.sharedMemPerMultiprocessor);
        printf("regsPerMultiProcess: %d\n", a.regsPerMultiprocessor);
        printf("   maxThreadsDim(x): %d\n", a.maxThreadsDim[0]);
        printf("   maxThreadsDim(y): %d\n", a.maxThreadsDim[1]);
        printf("   maxThreadsDim(z): %d\n", a.maxThreadsDim[2]);
        printf("     maxGridSize(x): %d\n", a.maxGridSize[0]);
        printf("     maxGridSize(y): %d\n", a.maxGridSize[1]);
        printf("     maxGridSize(y): %d\n", a.maxGridSize[2]);
        printf("      totalConstMem: %zu\n", a.totalConstMem);
        printf("              major: %d\n", a.major);
        printf("              minor: %d\n", a.minor);
        printf("   textureAlignment: %zu\n", a.textureAlignment);
        printf("      deviceOverlap: %d\n", a.deviceOverlap);
        printf("multiProcessorCount: %d\n", a.multiProcessorCount);
        printf("kernelExecTimeOEnab: %d\n", a.kernelExecTimeoutEnabled);
        printf("         integrated: %d\n", a.integrated);
        printf("   canMapHostMemory: %d\n", a.canMapHostMemory);
        printf("        computeMode: %d\n", a.computeMode);
        printf("       maxTexture1D: %d\n", a.maxTexture1D);
        printf(" maxTexture1DMipmap: %d\n", a.maxTexture1DMipmap);
        printf(" maxTexture1DLinear: %d\n", a.maxTexture1DLinear);
        printf("    surfaceAlignemt: %zu\n", a.surfaceAlignment);
        printf("  concurrentKernels: %d\n", a.concurrentKernels);
        printf("         ECCEnabled: %d\n", a.ECCEnabled);
        printf("           pciBusID: %d\n", a.pciBusID);
        printf("        pciDeviceID: %d\n", a.pciDeviceID);
        printf("        pciDomainID: %d\n", a.pciDomainID);
        printf("          tccDriver: %d\n", a.tccDriver);
        printf("   asyncEngineCount: %d\n", a.asyncEngineCount);
        printf(" streamPrioritiesSp: %d\n", a.streamPrioritiesSupported);
        printf("globalL1CacheSupprt: %d\n", a.globalL1CacheSupported);
        printf(" localL1CacheSupprt: %d\n", a.localL1CacheSupported);
        printf("\n");
    }
}

Robert_Crovella · December 14, 2022, 4:51pm

You can always file a bug requesting that.

Eric_Song · December 14, 2022, 6:07pm

I have filed this in the request.

Robert_Crovella · December 14, 2022, 9:28pm

My proposal (determine architecture from cudaGetDeviceProperties, determine TC per SM from arch whitepapers, multiply) won’t work for at least some cases. In particular there exist members of the sm_75 family that have no TC units, such as GTX 1660 (and others) as well as other members that certainly do have TC units (such as RTX 2060 and others). So you cannot simply get the architecture and multiply as I indicated. That will not work in all cases. A full treatment would require additional qualifying information. It might be possible to build a table based on the GPU name reported from cudaGetDeviceProperties.

Topic		Replies	Views
Query the number of tensor cores on the GPU? CUDA Programming and Performance cuda	11	1447	November 5, 2022
GTX 560 Ti number of cores 256 vs 384? CUDA Programming and Performance	1	10866	April 7, 2011
How to Check Detailed GPU Architecture Info (GA104/RTX 3060 Ti): GPC/TPC/SM/CUDA Core Counts? GPU - Hardware	1	169	July 22, 2025
[beginner ]cudaGetDeviceCount CUDA Programming and Performance	8	11372	May 4, 2010
gtx 470 showing 112 cores CUDA Programming and Performance	8	7632	June 29, 2010
Clarification on CUDA Core and Tensor Core counts for Jetson AGX Thor Jetson Thor cuda , gpu	7	470	January 19, 2026
Calculating CUDA cores CUDA Programming and Performance	4	2843	October 12, 2021
How to count cuda cores with numba? CUDA Programming and Performance	5	1714	October 12, 2021
Add ability to get tensor cores information System Management and Monitoring (NVML)	0	570	October 20, 2023
How to check the number of cuda cores used by my program or app? CUDA Programming and Performance	0	446	September 22, 2021

Can I get the number of Tensor cores of my GPU?

Related topics