Concurrent execution of CUDA and Tensor cores

nish511 · August 4, 2022, 10:46pm

I am trying some DL optimization and want to pipeline tiles of matmul output from Tensor core to following vector ops in cuda core. Can I run tensor core and cuda core concurrently on two different tiles of data in register file/shared memory?

Robert_Crovella · August 5, 2022, 12:34am

A tensor core instruction is an instruction like any other SASS instruction. A tensor core unit is a unit like any other functional unit in the GPU SM. A warp scheduler can theoretically issue a tensor core instruction in one cycle, and any other instruction (or even another tensorcore instruction) in the next cycle.

There are no inherent scheduling restrictions between tensor core activity and other types of GPU SM activity.

At the SASS level, tensorcore operands, as well as most other types of instruction operands, come from the register file. It is up to the compiler to determine how it will use registers, and how many registers. It’s also up to the compiler to determine how it will order or schedule instructions in the instruction stream.

You have essentially no control over any of this at the CUDA C++ or PTX level, and no tools whatsoever to perform source code programming at the SASS level.

To get a baseline understanding of how GPUs schedule work, I would suggest this series particularly unit 3 or 1 through 4.

nish511 · August 5, 2022, 2:28am

Thanks for such a great explanation. Follow up question:

Does it mean that technically nothing prohibits a tensor instruction and a cuda instruction issued back to back as long as register file can keep up but in practice compiler does not generally schedules this way? Is there a way to verify if compiler can schedule instruction such that tensor core and cuda core can operate simultaneously?

Robert_Crovella · August 5, 2022, 4:02am

Yes, that is what it means.

I don’t know where you got that. If the compiler did not schedule tensor core instructions along with other instructions, what else would it be doing? NOP? Empty space? Maybe you are mixing up what the compiler does and what the warp scheduler does. The warp scheduler can indeed end up with “empty space”. We call that a stall. (Technically, a warp is what will stall. But if all the warps assigned to a warp scheduler are stalled, then that SMSP - SM Sub Partition is effectively “stalled” or unable to issue.)

The CUDA binary utilities allow you to inspect exactly what the compiler did.

nish511 · August 5, 2022, 2:33pm

So if tensor core instruction and cuda instruction are issued back to back (1 cycle apart) it means tensor core and cuda core are executing their instruction one cycle apart and not in parallel. So SM can execute either tensor core instruction or cuda instruction in any given cycle and not both in same cycle. Is that true understanding?

Robert_Crovella · August 5, 2022, 2:40pm

Generally no, not correct. There are at least 2 factors to consider:

All instructions are pipelined and have latency. For example, a multiply instruction issued in cycle 0 may not produce a result until e.g. cycle 4. Likewise for tensor core op (wmma). So if an ordinary multiply is issued in cycle 0, producing its result in cycle 4, and a tensor core op is issued in cycle 1, producing its result in cycle 5 (for the sake of discussion), then during cycles 2 and 3 the SM (functional units) are actively involved in processing both ops at the same time.
Many modern SMs are broken into sub-partitions. Each sub-partition has a warp scheduler. So in the exact same cycle it is possible for a warp scheduler in sub-partition 0 to issue a tensor core op, while a warp scheduler in sub-partition 1 issues an ordinary multiply. These instructions would target separate functional units, of course.

plutohk99 · August 14, 2022, 9:44am

On the second point, can you give me an example? Thank you!

Robert_Crovella · August 14, 2022, 6:20pm

refer to page 22 here That is a picture of a single A100 SM. Note that there are 4 warp schedulers.

plutohk99 · August 16, 2022, 2:50am

Ok, thank you. I think I understand what you said.

nish511 · August 19, 2022, 10:50pm

I am trying various experiments to get the desired behavior but no success yet. Can CUDA core and Tensor core in same WARP be executing in parallel?

Lets assume I have
C = matmul(A,B)
D = multiply(C, constant)

I can break this like following

cycle 0:
C1 = matmul(A1,B1)

cycle 1:
C2 = matlul(A2,B2)
D1 = multiply(C1, constant)

In cycle 1 I have cuda core and tensor core working in parallel on different data within same warp. Is this possible.

If not, can one warp write to regfile of another warp to get this behavior? Can this be done in cuda c++?

njuffa · August 20, 2022, 12:31am

What GPU are you using? By which method do you determine “success” or the lack thereof? What does the CUDA profiler tell you about the bottlenecks in your code?

Robert_Crovella · August 20, 2022, 1:27am

I don’t think this is a very sensible exercise. Let me state some of the reasons why I think this. Please prepend each statement below with “I personally …”

Would not try to write my own tensorcore code, except under extreme duress. I would use a library like CUBLAS, or CUTLASS, to do the matrix-matrix multiply, and be very happy that someone else did the complicated work for me.
Would not try to worry about detailed instruction stream ordering. That is the compiler’s job, and the compiler is very smart and much smarter than me. I have essentially no direct control over it, anyway.
Have no idea how to verify if I actually achieved what you are suggesting or not. I have some knowledge of our profilers, and don’t know of any way, other than statistical/inferential methods, to actually confirm that a particular instruction issue sequence took place. I think it is more-or-less impossible. As far as I know, the relevant profiler (Nsight compute) gives you no way to inspect the behavior of a warp scheduler at a detailed level (like, a trace of its activity). As far as I know, the relevant profiler (Nsight compute) gives you no way to inspect the cycle-by-cycle pipeline activity of a particular functional unit (e.g. a tensor core unit, or a SP unit).
Don’t know why this would be an important objective:

In cycle 1 I have cuda core and tensor core working in parallel on different data within same warp. Is this possible.

That doesn’t reflect any of the important CUDA optimization objectives that I am aware of, and really the only reason I know of that people use CUDA for compute tasks is to make their code run faster. I don’t have any intuition or instruction that teaches me that that objective has anything to do with making my code run faster. It appears to me to be akin to a request to hold your breath and juggle at the same time.

So if I wanted to see if I could hold my breath and juggle at the same time (an activity that for me, personally, has no particular usefulness), I would probably just write a CUDA kernel that issued a wmma op followed by a multiply op. If that “worked” (whatever that means) great. If it didn’t “work”, I would not spend any more time trying to juggle while holding my breath. It’s not worth the effort. For me, personally, there is no return on investment.

So I would do something like this:

$ cat t2101.cu
#include <mma.h>
#include <iostream>
#include <cuda_fp16.h>
using namespace nvcuda;

__global__ void wmma_ker(half *a, half *b, float *c, float *d, float *e, float *f) {
   // Declare the fragments
   wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
   float d_val = d[threadIdx.x];
   float e_val = e[threadIdx.x];
   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0.0f);

   // Load the inputs
   wmma::load_matrix_sync(a_frag, a, 16);
   wmma::load_matrix_sync(b_frag, b, 16);

   // Perform the matrix multiplication
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
   float f_val = d_val * e_val;

   // Store the output
   wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
   f[threadIdx.x] = f_val;
}

int main(){

  half *d_a, *h_a, *d_b, *h_b;
  float *d_c, *h_c, *d_d, *d_e, *d_f;
  h_c = new float[16*16];
  h_b = new half[16*16];
  h_a = new half[16*16];
  cudaMalloc(&d_a, 16*16*sizeof(half));
  cudaMalloc(&d_b, 16*16*sizeof(half));
  cudaMalloc(&d_c, 16*16*sizeof(float));
  cudaMalloc(&d_d,    32*sizeof(float));
  cudaMalloc(&d_e,    32*sizeof(float));
  cudaMalloc(&d_f,    32*sizeof(float));
  for (int i = 0; i < 16*16; i++) {
    h_a[i] = 1.0f;
    h_b[i] = 1.0f;}
  for (int i = 0; i < 32; i++) h_c[i] = 2.0f;
  cudaMemcpy(d_d, h_c,    32*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_e, h_c,    32*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_a, h_a, 16*16*sizeof(half), cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, h_b, 16*16*sizeof(half), cudaMemcpyHostToDevice);
  wmma_ker<<<1,32>>>(d_a, d_b, d_c, d_d, d_e, d_f);
  cudaMemcpy(h_c, d_c, 16*16*sizeof(float), cudaMemcpyDeviceToHost);
  for (int i = 0; i < 16*16; i++) std::cout << h_c[i] << ",";
  std::cout << std::endl;
  cudaMemcpy(h_c, d_f,   32*sizeof(float), cudaMemcpyDeviceToHost);
  for (int i = 0; i < 32; i++) std::cout << h_c[i] << ",";
  std::cout << std::endl;
}
$ nvcc -o t2101 t2101.cu -arch=sm_70
$ compute-sanitizer ./t2101
========= COMPUTE-SANITIZER
16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
========= ERROR SUMMARY: 0 errors
$ cuobjdump -sass ./t2101

Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_70

Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_70
                Function : _Z8wmma_kerP6__halfS0_PfS1_S1_S1_
        .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;                          /* 0x00000a00ff017624 */
                                                                                                    /* 0x000fc400078e00ff */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;                                    /* 0x000000fffffff389 */
                                                                                                    /* 0x000fe200000e00ff */
        /*0020*/                   S2R R24, SR_LANEID ;                                             /* 0x0000000000187919 */
                                                                                                    /* 0x000e220000000000 */
        /*0030*/                   MOV R5, RZ ;                                                     /* 0x000000ff00057202 */
                                                                                                    /* 0x000fe40000000f00 */
        /*0040*/                   SHF.R.U32.HI R23, RZ, 0x2, R24.reuse ;                           /* 0x00000002ff177819 */
                                                                                                    /* 0x101fe40000011618 */
        /*0050*/                   SHF.R.U32.HI R22, RZ, 0x4, R24 ;                                 /* 0x00000004ff167819 */
                                                                                                    /* 0x000fe40000011618 */
        /*0060*/                   LOP3.LUT R23, R23, 0x3, RZ, 0xc0, !PT ;                          /* 0x0000000317177812 */
                                                                                                    /* 0x000fe400078ec0ff */
        /*0070*/                   LOP3.LUT R2, R22, 0x1, RZ, 0xc0, !PT ;                           /* 0x0000000116027812 */
                                                                                                    /* 0x000fe400078ec0ff */
        /*0080*/                   SHF.R.U32.HI R21, RZ, 0x1, R23 ;                                 /* 0x00000001ff157819 */
                                                                                                    /* 0x000fe20000011617 */
        /*0090*/                   IMAD.SHL.U32 R3, R23, 0x8, RZ ;                                  /* 0x0000000817037824 */
                                                                                                    /* 0x000fe200078e00ff */
        /*00a0*/                   LOP3.LUT R24, R24, 0x3, RZ, 0xc0, !PT ;                          /* 0x0000000318187812 */
                                                                                                    /* 0x000fc600078ec0ff */
        /*00b0*/                   IMAD R4, R21, 0x2, R2 ;                                          /* 0x0000000215047824 */
                                                                                                    /* 0x000fe200078e0202 */
        /*00c0*/                   LOP3.LUT R3, R3, 0x8, RZ, 0xe2, !PT ;                            /* 0x0000000803037812 */
                                                                                                    /* 0x000fc600078ee2ff */
        /*00d0*/                   IMAD.SHL.U32 R4, R4, 0x4, RZ ;                                   /* 0x0000000404047824 */
                                                                                                    /* 0x000fe400078e00ff */
        /*00e0*/                   IMAD R2, R2, 0x4, R3 ;                                           /* 0x0000000402027824 */
                                                                                                    /* 0x000fe400078e0203 */
        /*00f0*/                   IMAD.MOV.U32 R3, RZ, RZ, RZ ;                                    /* 0x000000ffff037224 */
                                                                                                    /* 0x000fe400078e00ff */
        /*0100*/                   IMAD.WIDE.U32 R4, R24, 0x10, R4 ;                                /* 0x0000001018047825 */
                                                                                                    /* 0x000fc800078e0004 */
        /*0110*/                   IMAD.WIDE.U32 R2, R24, 0x10, R2 ;                                /* 0x0000001018027825 */
                                                                                                    /* 0x000fc600078e0002 */
        /*0120*/                   LEA R26, P1, R4, c[0x0][0x168], 0x1 ;                            /* 0x00005a00041a7a11 */
                                                                                                    /* 0x000fc800078208ff */
        /*0130*/                   LEA R18, P0, R2, c[0x0][0x160], 0x1 ;                            /* 0x0000580002127a11 */
                                                                                                    /* 0x000fe400078008ff */
        /*0140*/                   LEA.HI.X R27, R4, c[0x0][0x16c], R5, 0x1, P1 ;                   /* 0x00005b00041b7a11 */
                                                                                                    /* 0x000fe400008f0c05 */
        /*0150*/                   LEA.HI.X R19, R2, c[0x0][0x164], R3, 0x1, P0 ;                   /* 0x0000590002137a11 */
                                                                                                    /* 0x000fcc00000f0c03 */
        /*0160*/                   LDG.E.64.SYS R16, [R26] ;                                        /* 0x000000001a107381 */
                                                                                                    /* 0x000ea800001eeb00 */
        /*0170*/                   LDG.E.64.SYS R14, [R18] ;                                        /* 0x00000000120e7381 */
                                                                                                    /* 0x000ea800001eeb00 */
        /*0180*/                   LDG.E.64.SYS R4, [R18+0x80] ;                                    /* 0x0000800012047381 */
                                                                                                    /* 0x000ee800001eeb00 */
        /*0190*/                   LDG.E.64.SYS R6, [R26+0x80] ;                                    /* 0x000080001a067381 */
                                                                                                    /* 0x000ee800001eeb00 */
        /*01a0*/                   S2R R2, SR_TID.X ;                                               /* 0x0000000000027919 */
                                                                                                    /* 0x000e220000002100 */
        /*01b0*/                   IMAD.MOV.U32 R0, RZ, RZ, 0x4 ;                                   /* 0x00000004ff007424 */
                                                                                                    /* 0x000fc800078e00ff */
        /*01c0*/                   IMAD.WIDE.U32 R28, R2, R0, c[0x0][0x180] ;                       /* 0x00006000021c7625 */
                                                                                                    /* 0x001fd000078e0000 */
        /*01d0*/                   LDG.E.SYS R20, [R28] ;                                           /* 0x000000001c147381 */
                                                                                                    /* 0x000f2200001ee900 */
        /*01e0*/                   HMMA.884.F32.F32.STEP0 R8, R14.reuse.COL, R16.reuse.ROW, RZ ;    /* 0x000000100e087236 */
                                                                                                    /* 0x0c4fe400000052ff */
        /*01f0*/                   HMMA.884.F32.F32.STEP1 R10, R14.reuse.COL, R16.reuse.ROW, RZ ;   /* 0x000000100e0a7236 */
                                                                                                    /* 0x0c0fe4000000d2ff */
        /*0200*/                   HMMA.884.F32.F32.STEP2 R12, R14.reuse.COL, R16.reuse.ROW, RZ ;   /* 0x000000100e0c7236 */
                                                                                                    /* 0x0c0fe400000152ff */
        /*0210*/                   HMMA.884.F32.F32.STEP3 R14, R14.COL, R16.ROW, RZ ;               /* 0x000000100e0e7236 */
                                                                                                    /* 0x000f64000001d2ff */
        /*0220*/                   HMMA.884.F32.F32.STEP0 R8, R4.reuse.COL, R6.reuse.ROW, R8 ;      /* 0x0000000604087236 */
                                                                                                    /* 0x0e8fe20000005208 */
        /*0230*/                   LDG.E.64.SYS R16, [R18+0x100] ;                                  /* 0x0001000012107381 */
                                                                                                    /* 0x0000a200001eeb00 */
        /*0240*/                   HMMA.884.F32.F32.STEP1 R10, R4.reuse.COL, R6.reuse.ROW, R10 ;    /* 0x00000006040a7236 */
                                                                                                    /* 0x0c0fe4000000d20a */
        /*0250*/                   HMMA.884.F32.F32.STEP2 R12, R4.reuse.COL, R6.reuse.ROW, R12 ;    /* 0x00000006040c7236 */
                                                                                                    /* 0x0c0fe4000001520c */
        /*0260*/                   HMMA.884.F32.F32.STEP3 R14, R4.COL, R6.ROW, R14 ;                /* 0x00000006040e7236 */
                                                                                                    /* 0x000b64000001d20e */
        /*0270*/                   LDG.E.64.SYS R4, [R18+0x180] ;                                   /* 0x0001800012047381 */
                                                                                                    /* 0x0200e800001eeb00 */
        /*0280*/                   LDG.E.64.SYS R6, [R26+0x180] ;                                   /* 0x000180001a067381 */
                                                                                                    /* 0x0002e800001eeb00 */
        /*0290*/                   LDG.E.64.SYS R18, [R26+0x100] ;                                  /* 0x000100001a127381 */
                                                                                                    /* 0x0012a400001eeb00 */
        /*02a0*/                   IMAD.WIDE.U32 R26, R2, R0, c[0x0][0x178] ;                       /* 0x00005e00021a7625 */
                                                                                                    /* 0x002fd000078e0000 */
        /*02b0*/                   LDG.E.SYS R3, [R26] ;                                            /* 0x000000001a037381 */
                                                                                                    /* 0x000f2200001ee900 */
        /*02c0*/                   IMAD.SHL.U32 R25, R22, 0x4, RZ ;                                 /* 0x0000000416197824 */
                                                                                                    /* 0x000fca00078e00ff */
        /*02d0*/                   LOP3.LUT R24, R25, 0x4, R24, 0xe2, !PT ;                         /* 0x0000000419187812 */
                                                                                                    /* 0x000fe200078ee218 */
        /*02e0*/                   IMAD.SHL.U32 R23, R23, 0x8, RZ ;                                 /* 0x0000000817177824 */
                                                                                                    /* 0x000fc600078e00ff */
        /*02f0*/                   LOP3.LUT R22, R24.reuse, 0x2, RZ, 0xc0, !PT ;                    /* 0x0000000218167812 */
                                                                                                    /* 0x040fe400078ec0ff */
        /*0300*/                   LOP3.LUT R24, R24, 0x5, RZ, 0xc0, !PT ;                          /* 0x0000000518187812 */
                                                                                                    /* 0x000fe400078ec0ff */
        /*0310*/                   LEA R22, R21, R22, 0x3 ;                                         /* 0x0000001615167211 */
                                                                                                    /* 0x000fe400078e18ff */
        /*0320*/                   LOP3.LUT R21, R23, 0x8, R24, 0xe2, !PT ;                         /* 0x0000000817157812 */
                                                                                                    /* 0x000fe200078ee218 */
        /*0330*/                   IMAD.MOV.U32 R23, RZ, RZ, RZ ;                                   /* 0x000000ffff177224 */
                                                                                                    /* 0x000fc800078e00ff */
        /*0340*/                   IMAD.WIDE.U32 R22, R21, 0x10, R22 ;                              /* 0x0000001015167825 */
                                                                                                    /* 0x000fe200078e0016 */
        /*0350*/                   HMMA.884.F32.F32.STEP0 R8, R16.reuse.COL, R18.reuse.ROW, R8 ;    /* 0x0000001210087236 */
                                                                                                    /* 0x0c4fe40000005208 */
        /*0360*/                   HMMA.884.F32.F32.STEP1 R10, R16.reuse.COL, R18.reuse.ROW, R10 ;  /* 0x00000012100a7236 */
                                                                                                    /* 0x0c0fe4000000d20a */
        /*0370*/                   HMMA.884.F32.F32.STEP2 R12, R16.reuse.COL, R18.reuse.ROW, R12 ;  /* 0x00000012100c7236 */
                                                                                                    /* 0x0c0fe4000001520c */
        /*0380*/                   HMMA.884.F32.F32.STEP3 R14, R16.COL, R18.ROW, R14 ;              /* 0x00000012100e7236 */
                                                                                                    /* 0x000f64000001d20e */
        /*0390*/                   HMMA.884.F32.F32.STEP0 R8, R4.reuse.COL, R6.reuse.ROW, R8 ;      /* 0x0000000604087236 */
                                                                                                    /* 0x0e8f620000005208 */
        /*03a0*/                   LEA R16, P0, R22.reuse, c[0x0][0x170], 0x2 ;                     /* 0x00005c0016107a11 */
                                                                                                    /* 0x040fe200078010ff */
        /*03b0*/                   HMMA.884.F32.F32.STEP1 R10, R4.reuse.COL, R6.reuse.ROW, R10 ;    /* 0x00000006040a7236 */
                                                                                                    /* 0x0c0f64000000d20a */
        /*03c0*/                   HMMA.884.F32.F32.STEP2 R12, R4.reuse.COL, R6.reuse.ROW, R12 ;    /* 0x00000006040c7236 */
                                                                                                    /* 0x0c0f64000001520c */
        /*03d0*/                   HMMA.884.F32.F32.STEP3 R14, R4.COL, R6.ROW, R14 ;                /* 0x00000006040e7236 */
                                                                                                    /* 0x000b62000001d20e */
        /*03e0*/                   LEA.HI.X R17, R22, c[0x0][0x174], R23, 0x2, P0 ;                 /* 0x00005d0016117a11 */
                                                                                                    /* 0x000fe200000f1417 */
        /*03f0*/                   FMUL R5, R20, R3 ;                                               /* 0x0000000314057220 */
                                                                                                    /* 0x030fc40000400000 */
        /*0400*/                   IMAD.WIDE.U32 R2, R2, R0, c[0x0][0x188] ;                        /* 0x0000620002027625 */
                                                                                                    /* 0x000fca00078e0000 */
        /*0410*/                   STG.E.64.SYS [R16], R8 ;                                         /* 0x0000000810007386 */
                                                                                                    /* 0x000fe8000010eb00 */
        /*0420*/                   STG.E.64.SYS [R16+0x80], R10 ;                                   /* 0x0000800a10007386 */
                                                                                                    /* 0x000fe8000010eb00 */
        /*0430*/                   STG.E.64.SYS [R16+0x10], R12 ;                                   /* 0x0000100c10007386 */
                                                                                                    /* 0x000fe8000010eb00 */
        /*0440*/                   STG.E.64.SYS [R16+0x90], R14 ;                                   /* 0x0000900e10007386 */
                                                                                                    /* 0x000fe8000010eb00 */
        /*0450*/                   STG.E.SYS [R2], R5 ;                                             /* 0x0000000502007386 */
                                                                                                    /* 0x000fe2000010e900 */
        /*0460*/                   EXIT ;                                                           /* 0x000000000000794d */
                                                                                                    /* 0x000fea0003800000 */
        /*0470*/                   BRA 0x470;                                                       /* 0xfffffff000007947 */
                                                                                                    /* 0x000fc0000383ffff */
                ..........



Fatbin ptx code:
================
arch = sm_70
code version = [7,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
$

As you can see, I got “close” but not quite back-to-back in the instruction stream:

        /*03d0*/                   HMMA.884.F32.F32.STEP3 R14, R4.COL, R6.ROW, R14 ;                /* 0x00000006040e7236 */
                                                                                                    /* 0x000b62000001d20e */
        /*03e0*/                   LEA.HI.X R17, R22, c[0x0][0x174], R23, 0x2, P0 ;                 /* 0x00005d0016117a11 */
                                                                                                    /* 0x000fe200000f1417 */
        /*03f0*/                   FMUL R5, R20, R3 ;                                               /* 0x0000000314057220 */

There is that load-effective-address instruction that the compiler decided to schedule after that last tensor core op but before the multiply step.

Note that the LEA instruction in question is computing the value that goes in register R17, which is part of the R16,R17 register pair which comprises the storage base address for d_c. You and I could compute what that value should be before we even launched the kernel, if we were so inclined. That means the compiler could have scheduled it anywhere in the kernel code, right up to the point where the results fragment is being written. It chose to put it there. It might have a very good reason to do so. I have no reason whatsoever to assume that I know better than the compiler how to get good performance out of the machine, and therefore feel like that instruction should be placed elsewhere, so that I can achieve this quixotic quest of having the HMMA and FMUL instruction scheduled (ordered in the instruction stream, no idea about issue sequence) back-to-back.

If you feel this is a worthwhile exercise, feel free to wrestle with the compiler to see if you can get the HMMA instruction and the FMUL instruction to be back-to-back. Good luck!

Later:

Hey, I got lucky! When I switched to compiling for -arch=sm_75 I got different looking code:

$ nvcc -o t2101 t2101.cu -arch=sm_75
$ cuobjdump -sass ./t2101

Fatbin elf code:
================
arch = sm_75
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_75

Fatbin elf code:
================
arch = sm_75
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_75
                Function : _Z8wmma_kerP6__halfS0_PfS1_S1_S1_
        .headerflags    @"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;                    /* 0x00000a00ff017624 */
                                                                                              /* 0x000fc400078e00ff */
        /*0010*/                   S2R R5, SR_LANEID ;                                        /* 0x0000000000057919 */
                                                                                              /* 0x000e220000000000 */
        /*0020*/                   IMAD.MOV.U32 R3, RZ, RZ, RZ ;                              /* 0x000000ffff037224 */
                                                                                              /* 0x000fe200078e00ff */
        /*0030*/                   LOP3.LUT R2, R5, 0x3, RZ, 0xc0, !PT ;                      /* 0x0000000305027812 */
                                                                                              /* 0x001fe400078ec0ff */
        /*0040*/                   SHF.R.U32.HI R5, RZ, 0x2, R5 ;                             /* 0x00000002ff057819 */
                                                                                              /* 0x000fca0000011605 */
        /*0050*/                   IMAD.WIDE.U32 R2, R5, 0x8, R2 ;                            /* 0x0000000805027825 */
                                                                                              /* 0x000fce00078e0002 */
        /*0060*/                   LEA R8, P0, R2.reuse, c[0x0][0x160], 0x2 ;                 /* 0x0000580002087a11 */
                                                                                              /* 0x040fe400078010ff */
        /*0070*/                   LEA R10, P1, R2.reuse, c[0x0][0x168], 0x2 ;                /* 0x00005a00020a7a11 */
                                                                                              /* 0x040fe400078210ff */
        /*0080*/                   LEA.HI.X R9, R2.reuse, c[0x0][0x164], R3.reuse, 0x2, P0 ;  /* 0x0000590002097a11 */
                                                                                              /* 0x140fe400000f1403 */
        /*0090*/                   LEA.HI.X R11, R2, c[0x0][0x16c], R3, 0x2, P1 ;             /* 0x00005b00020b7a11 */
                                                                                              /* 0x000fcc00008f1403 */
        /*00a0*/                   LDG.E.SYS R0, [R8] ;                                       /* 0x0000000008007381 */
                                                                                              /* 0x000ea800001ee900 */
        /*00b0*/                   LDG.E.SYS R18, [R10] ;                                     /* 0x000000000a127381 */
                                                                                              /* 0x000ee800001ee900 */
        /*00c0*/                   LDG.E.SYS R12, [R8+0x10] ;                                 /* 0x00001000080c7381 */
                                                                                              /* 0x000f2800001ee900 */
        /*00d0*/                   LDG.E.SYS R20, [R10+0x10] ;                                /* 0x000010000a147381 */
                                                                                              /* 0x000f2800001ee900 */
        /*00e0*/                   S2R R22, SR_TID.X ;                                        /* 0x0000000000167919 */
                                                                                              /* 0x000e280000002100 */
        /*00f0*/                   LDG.E.SYS R14, [R8+0x100] ;                                /* 0x00010000080e7381 */
                                                                                              /* 0x000f2800001ee900 */
        /*0100*/                   LDG.E.SYS R19, [R10+0x100] ;                               /* 0x000100000a137381 */
                                                                                              /* 0x000f2800001ee900 */
        /*0110*/                   LDG.E.SYS R15, [R8+0x110] ;                                /* 0x00011000080f7381 */
                                                                                              /* 0x000f2800001ee900 */
        /*0120*/                   LDG.E.SYS R21, [R10+0x110] ;                               /* 0x000110000a157381 */
                                                                                              /* 0x000f2200001ee900 */
        /*0130*/                   IMAD.MOV.U32 R23, RZ, RZ, 0x4 ;                            /* 0x00000004ff177424 */
                                                                                              /* 0x000fc800078e00ff */
        /*0140*/                   IMAD.WIDE.U32 R6, R22, R23, c[0x0][0x180] ;                /* 0x0000600016067625 */
                                                                                              /* 0x001fc800078e0017 */
        /*0150*/                   IMAD.WIDE.U32 R4, R22, R23, c[0x0][0x178] ;                /* 0x00005e0016047625 */
                                                                                              /* 0x000fc800078e0017 */
        /*0160*/                   LDG.E.SYS R6, [R6] ;                                       /* 0x0000000006067381 */
                                                                                              /* 0x000f2800001ee900 */
        /*0170*/                   LDG.E.SYS R13, [R4] ;                                      /* 0x00000000040d7381 */
                                                                                              /* 0x00012400001ee900 */
        /*0180*/                   LEA R4, P0, R2, c[0x0][0x170], 0x3 ;                       /* 0x00005c0002047a11 */
                                                                                              /* 0x001fc800078018ff */
        /*0190*/                   LEA.HI.X R5, R2, c[0x0][0x174], R3, 0x3, P0 ;              /* 0x00005d0002057a11 */
                                                                                              /* 0x000fe200000f1c03 */
        /*01a0*/                   IMAD.WIDE.U32 R2, R22, R23, c[0x0][0x188] ;                /* 0x0000620016027625 */
                                                                                              /* 0x000fe200078e0017 */
        /*01b0*/                   MOVM.16.MT88 R16, R0 ;                                     /* 0x000000000010723a */
                                                                                              /* 0x004fe80000000000 */
        /*01c0*/                   MOVM.16.MT88 R18, R18 ;                                    /* 0x000000001212723a */
                                                                                              /* 0x008fe80000000000 */
        /*01d0*/                   MOVM.16.MT88 R17, R12 ;                                    /* 0x000000000c11723a */
                                                                                              /* 0x010e280000000000 */
        /*01e0*/                   MOVM.16.MT88 R20, R20 ;                                    /* 0x000000001414723a */
                                                                                              /* 0x000e680000000000 */
        /*01f0*/                   MOVM.16.MT88 R14, R14 ;                                    /* 0x000000000e0e723a */
                                                                                              /* 0x000fe80000000000 */
        /*0200*/                   MOVM.16.MT88 R19, R19 ;                                    /* 0x000000001313723a */
                                                                                              /* 0x000fe80000000000 */
        /*0210*/                   MOVM.16.MT88 R15, R15 ;                                    /* 0x000000000f0f723a */
                                                                                              /* 0x000ea80000000000 */
        /*0220*/                   MOVM.16.MT88 R21, R21 ;                                    /* 0x000000001515723a */
                                                                                              /* 0x000ee20000000000 */
        /*0230*/                   HMMA.1688.F32 R8, R16, R18, RZ ;                           /* 0x000000121008723c */
                                                                                              /* 0x001f7000000010ff */
        /*0240*/                   HMMA.1688.F32 R24, R16, R20, RZ ;                          /* 0x000000141018723c */
                                                                                              /* 0x002f7000000010ff */
        /*0250*/                   HMMA.1688.F32 R8, R14, R19, R8 ;                           /* 0x000000130e08723c */
                                                                                              /* 0x024f620000001008 */
        /*0260*/                   FMUL R13, R6, R13 ;                                        /* 0x0000000d060d7220 */
                                                                                              /* 0x000fce0000400000 */
        /*0270*/                   HMMA.1688.F32 R24, R14, R21, R24 ;                         /* 0x000000150e18723c */
                                                                                              /* 0x008f6c0000001018 */
        /*0280*/                   STG.E.64.SYS [R4], R8 ;                                    /* 0x0000000804007386 */
                                                                                              /* 0x020fe8000010eb00 */
        /*0290*/                   STG.E.64.SYS [R4+0x200], R10 ;                             /* 0x0002000a04007386 */
                                                                                              /* 0x000fe8000010eb00 */
        /*02a0*/                   STG.E.64.SYS [R4+0x20], R24 ;                              /* 0x0000201804007386 */
                                                                                              /* 0x000fe8000010eb00 */
        /*02b0*/                   STG.E.64.SYS [R4+0x220], R26 ;                             /* 0x0002201a04007386 */
                                                                                              /* 0x000fe8000010eb00 */
        /*02c0*/                   STG.E.SYS [R2], R13 ;                                      /* 0x0000000d02007386 */
                                                                                              /* 0x000fe2000010e900 */
        /*02d0*/                   EXIT ;                                                     /* 0x000000000000794d */
                                                                                              /* 0x000fea0003800000 */
        /*02e0*/                   BRA 0x2e0;                                                 /* 0xfffffff000007947 */
                                                                                              /* 0x000fc0000383ffff */
        /*02f0*/                   NOP;                                                       /* 0x0000000000007918 */
                                                                                              /* 0x000fc00000000000 */
                ..........

See here:

    /*0240*/                   HMMA.1688.F32 R24, R16, R20, RZ ;                          /* 0x000000141018723c */
                                                                                          /* 0x002f7000000010ff */
    /*0250*/                   HMMA.1688.F32 R8, R14, R19, R8 ;                           /* 0x000000130e08723c */
                                                                                          /* 0x024f620000001008 */
    /*0260*/                   FMUL R13, R6, R13 ;                                        /* 0x0000000d060d7220 */
                                                                                          /* 0x000fce0000400000 */
    /*0270*/                   HMMA.1688.F32 R24, R14, R21, R24 ;                         /* 0x000000150e18723c */
                                                                                          /* 0x008f6c0000001018 */

hexexpert5 · December 16, 2022, 12:08am

I too am trying to figure this out. CUDA and Tensor cores both have a large amount of computational capability. If the NVIDIA tools say my GPU is 100% busy then how can that be if I have no Tensor core processing? Why do the NVidia tools only show info for the CUDA cores and not the Tensor cores. Somewhere I saw a reference to you needing to write C code using !!!profiling libraries!!! just to do basic monitoring of your GPU. How many years has NVidia had Tensor cores and why don’t they have a command line tool yet for this?

Robert_Crovella · December 16, 2022, 12:35am

The profilers can show you the relative loading of various pipes, such as math and tensor core. Please ask profiler questions on the appropriate profiler forum. The profilers have command-line operations possible.

If by “NVIDIA tools” you mean nvidia-smi, the behavior is described here. It doesn’t refer to either CUDA or Tensor cores.

You can always request enhancements to the CUDA ecosystem by filing a bug.

hexexpert5 · December 16, 2022, 6:14pm

‘->’ It doesn’t refer to either CUDA or Tensor cores.
From the top line output from nvidia-smi:
NVIDIA-SMI 525.60.11
Driver Version: 525.60.11
CUDA Version: 12.0
Yes, the CUDA mentioned is software to access the CUDA cores. Isn’t there ALSO some NVidia lib’s with a version number which accesses the Tensor cores? Is there some difference in CUDA vs Tensor that seems to create a bias in nearly everything I see referring to cuda this and cuda that as in cuda mem copy? I understand the difference between what operations each can do but I must be missing something.

Yes, I will be looking at the profilers you have, not because I actually want to do “profiling” now, but because it seems that the Tensor stuff is hidden there and not exposed in the same easy out front way as the CUDA stuff is. I am a very low level programmer that has dived into hardware profiling counters, cache latencies, just to scratch the surface. With respect to CUDA/Tensor something is not as it seems and no one seems to be able to explain it. In understanding why they are treated differently only then will I achieve the mastery of this new subject(GPU) like I have for Intel or Sparc CPU’s.

Thanks for you time and I will look into the profiling forum.

hexexpert5 · December 16, 2022, 6:20pm

‘->’ It doesn’t refer to either CUDA or Tensor cores. PART2
So are you saying that when nvidia-smi says the GPU is 100% busy that all computational units, both CUDA and Tensor cores, are active? 100% busy means just that.

njuffa · December 16, 2022, 6:29pm

That is not what it means. Does your OS indicate a CPU as 100% busy even if it is not doing any floating-point computation?

Robert_Crovella · December 16, 2022, 6:54pm

I should have said it does not refer to either CUDA cores or Tensor cores. As indicated in the article I linked, it is largely a time sampled measurement which basically indicates for what percentage of the last sampling period a kernel was running on the GPU. As indicated in the article I linked, it tells you nothing about what that kernel is doing.

A CUDA core in NVIDIA parlance is a SP unit which handles a few different instructions including FFMA and FMUL and FADD. Your kernel doesn’t need to be doing any of these and can still register 100% on the utilization metric provided by nvidia-smi.

Just to add to njuffa’s point, I can run top and witness that a process is using 100% of CPU and yet that process is not doing any AVX, or other vectorization instructions, even though the CPU has those types of execution resources.

hexexpert5 · December 16, 2022, 7:21pm

A “single” CPU or to be more accurate a “single” hardware thread is deemed to be 100% user busy if a software thread is continuously executing on it. However, if I have 10 hardware threads or cores then my system will only be 10% busy(with respect to %usr). Let’s not get into %sys and interrupt processing times.

My GPU has 16,384 CUDA cores and ?512? Tensor cores. If I do a huge matrix multiple(fp32) I suspect it’ll keep ALL the CUDA cores busy. However, the Tensor cores with a similar total computational ability are idle. If I do a fp16 multiply which the Tensor cores can handle and have all of them busy when will nvidia-smi or nvtop show 100% busy, 50% busy or 0% busy because the tools are NOT tracking the Tensor core usage while the CUDA cores are idle?

Robert_Crovella · December 16, 2022, 7:28pm

I don’t know anything about nvtop. For nvidia-smi I have already explained that the tool tracks neither tensor core usage nor CUDA core usage. The article I linked explains this in some detail and also gives a complete test case for you to make your own experiments, if you wish, to confirm these statements.

Topic		Replies	Views
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1889	September 27, 2024
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3278	January 20, 2019
Instruction level parallelism and maximum instruction per clock CUDA Programming and Performance	6	1111	September 3, 2019
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	2962	October 5, 2022
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2437	October 25, 2017
performance difference for cuda between experiments and the documentation for float/double data type... CUDA Programming and Performance	8	1907	October 28, 2016
PTX instruction `mma` not lowered to tensor core related SASS instruction TensorRT	2	1308	March 22, 2022
cuBLAS INT8 tensor core mode vs. FP16 mode GPU-Accelerated Libraries cublas	13	5426	December 5, 2022
Blackwell Integer CUDA Programming and Performance	134	2208	June 11, 2025
Is it dangerous to mix warp shuffles with bitwise or logical operators in same instruction? CUDA Programming and Performance	18	52	March 2, 2025

Concurrent execution of CUDA and Tensor cores

Related topics