Can I get the number of Tensor cores of my GPU?

Hi, guys,
I want to know if I can get the number of Tensor cores of my GPU with CUDA API.

Your answer and guidance will be appreciated!

There isn’t any API to do that.

1 Like

Thank you sincerely for the response!

Hi, @Robert_Crovella ,
since I cannot get Tensor cores number with CUDA API, would it be explicitly described in some official documentation?

There are various architecture whitepapers that indicate the number of tensor cores (TC). I won’t be able to give you a laundry list of all of them, and its quite possible that this method doesn’t cover every possible GPU that has TC. You would have to come up with your own calculations based on knowing how many TC there are per SM for the specific architecture of your GPU, then multiply that by the number of SMs in your GPU.

Here is an example. The GA102 whitepaper specifically covers for example RTX3090, which is a sm_86 GPU. Table 9 in that document indicates both the number of TC per SM as well as the TC per GPU. If you have a 3090, then TC per GPU indicated there is the answer. However if you have a 3080, or some other sm_86 architecture GPU, you could multiply the TC per SM number there by the number of SMs in your sm_86 GPU, and get the answer that way.

Although it doesn’t cover TC, a similar methodology can be used to calculate CUDA cores per GPU, and a sample method is given here for C++ and here for python. A similar method could be used for TC. You would have to come up with the multiplication factors for each sm type, perhaps by studying the relevant architecture whitepapers.

1 Like

It should be added to the cudaGetDeviceProperties because that function gets a LOT of information:


The properties I am dumping from the call:

It would be suggested to step this code might hold the tensor core parameter in the latest toolkit if you stepped it with a card that has Tensor cores? The 1660 does not.

void Cuda_device_poll(cudaDeviceProp* props, int device_count)
    if (device_count > 0)
      for(int t = 0; t < device_count; t++)
        printf("Found %d cuda device...\n", device_count);
        cudaGetDeviceProperties(&props[t], 0);
    if (device_count == 0)
        printf("No cuda devices found exiting...\n");

void Cuda_print_properties(cudaDeviceProp* prop)
    int card_count  = Cuda_device_count();
    for (int t = 0; t < card_count; t++)
        cudaDeviceProp a = prop[t];
        printf("Device: %d [%s] \n", t,;
        unsigned char uuid[16] = {0};
        for (int u = 0; u < 16; u++)
           uuid[u] = a.uuid.bytes[u];
        printf("               UUID: %x%x%x%x-%x%x%x%x-%x%x%x%x-%x%x%x%x\n",uuid[0], uuid[1], uuid[2], uuid[3], uuid[4], uuid[5], uuid[6], uuid[7], uuid[8],
               uuid[9], uuid[10], uuid[11], uuid[12], uuid[13], uuid[14], uuid[15]);
        unsigned char luid[8] = {0};
        for (int u = 0; u < 8; u++)
            luid[u] = a.luid[u];
        printf("               LUID: %x%x%x%x-%x%x%x%x\n", luid[0], luid[1], luid[2], luid[3], luid[4], luid[5], luid[6], luid[7]);

        printf(" luidDeviceNodeMask: %u\n", a.luidDeviceNodeMask);
        char metric[20] = {0};
        size_t total_mem = a.totalGlobalMem;
        char_to_NB(total_mem, metric);
        printf("     totalGlobalMem: %ld\n", a.totalGlobalMem);
        printf("  sharedMemPerBlock: %zu\n", a.sharedMemPerBlock);
        printf("       regsPerBlock: %d\n", a.regsPerBlock);
        printf("           warpSize: %d\n", a.warpSize);
        printf("           memPitch: %zu\n", a.memPitch);
        printf("          clockRate: %d\n", a.clockRate);
        printf("    memoryClockRate: %d\n", a.memoryClockRate);
        printf("     memoryBusWidth: %d\n", a.memoryBusWidth);
        printf(" maxThreadsPerBlock: %d\n", a.maxThreadsPerBlock);
        printf("maxThreadsPerMultiP: %d\n", a.maxThreadsPerMultiProcessor);
        printf("ShrdMemPerMultiProc: %zu\n", a.sharedMemPerMultiprocessor);
        printf("regsPerMultiProcess: %d\n", a.regsPerMultiprocessor);
        printf("   maxThreadsDim(x): %d\n", a.maxThreadsDim[0]);
        printf("   maxThreadsDim(y): %d\n", a.maxThreadsDim[1]);
        printf("   maxThreadsDim(z): %d\n", a.maxThreadsDim[2]);
        printf("     maxGridSize(x): %d\n", a.maxGridSize[0]);
        printf("     maxGridSize(y): %d\n", a.maxGridSize[1]);
        printf("     maxGridSize(y): %d\n", a.maxGridSize[2]);
        printf("      totalConstMem: %zu\n", a.totalConstMem);
        printf("              major: %d\n", a.major);
        printf("              minor: %d\n", a.minor);
        printf("   textureAlignment: %zu\n", a.textureAlignment);
        printf("      deviceOverlap: %d\n", a.deviceOverlap);
        printf("multiProcessorCount: %d\n", a.multiProcessorCount);
        printf("kernelExecTimeOEnab: %d\n", a.kernelExecTimeoutEnabled);
        printf("         integrated: %d\n", a.integrated);
        printf("   canMapHostMemory: %d\n", a.canMapHostMemory);
        printf("        computeMode: %d\n", a.computeMode);
        printf("       maxTexture1D: %d\n", a.maxTexture1D);
        printf(" maxTexture1DMipmap: %d\n", a.maxTexture1DMipmap);
        printf(" maxTexture1DLinear: %d\n", a.maxTexture1DLinear);
        printf("    surfaceAlignemt: %zu\n", a.surfaceAlignment);
        printf("  concurrentKernels: %d\n", a.concurrentKernels);
        printf("         ECCEnabled: %d\n", a.ECCEnabled);
        printf("           pciBusID: %d\n", a.pciBusID);
        printf("        pciDeviceID: %d\n", a.pciDeviceID);
        printf("        pciDomainID: %d\n", a.pciDomainID);
        printf("          tccDriver: %d\n", a.tccDriver);
        printf("   asyncEngineCount: %d\n", a.asyncEngineCount);
        printf(" streamPrioritiesSp: %d\n", a.streamPrioritiesSupported);
        printf("globalL1CacheSupprt: %d\n", a.globalL1CacheSupported);
        printf(" localL1CacheSupprt: %d\n", a.localL1CacheSupported);
1 Like

You can always file a bug requesting that.

1 Like

I have filed this in the request.

My proposal (determine architecture from cudaGetDeviceProperties, determine TC per SM from arch whitepapers, multiply) won’t work for at least some cases. In particular there exist members of the sm_75 family that have no TC units, such as GTX 1660 (and others) as well as other members that certainly do have TC units (such as RTX 2060 and others). So you cannot simply get the architecture and multiply as I indicated. That will not work in all cases. A full treatment would require additional qualifying information. It might be possible to build a table based on the GPU name reported from cudaGetDeviceProperties.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.