How to determine the number of running cores, i.e., the utilization of GPU computing resources

The given “Volatile Util” from nvidia-smi tells me the active time of kernel in a sample period.

But it doesn’t tell me how many cores running and how many cores idle, i.e., the utilization of GPU computing resources.

Is there any method to get that?

You won’t find any tool that at an instant in time tells you anything about utilization at this level.

The profiler can give some statistical information about utilization, but again doesn’t tell you how many cores are “running” or “idle” at any given moment.

A GPU core is not much like a CPU core. A GPU core or “SP” is essentially a floating point adder/multiplier. That’s it. When dispatched a floating point instruction, it will compute a floating point addition or multiplication. It doesn’t do anything else. Furthermore, a GPU core/SP, like all functional units in the GPU SM, is pipelined. Whether or not it is busy or doing anything across all of its pipeline stages at any given instant depends on the activity history in the SM.

Please note that the metric in nvidia-smi is not “Volatile Util”. The user interface of nvidia-smi is confusing. You basically have to match up the items in a description field by line and / or quadrant to the identically arranged data in the data field below it.

In this particular field, there are three items. The top line is Volatile Uncorr. ECC, which shows the number of uncorrected errors that have occurred on the GPU since the driver was initialized. If your GPU doesn’t support ECC, or ECC is disabled, the top line of data field below will show N/A in the top line, otherwise the error count (0 in most cases). This counter is called volatile because it gets restarted on driver initialization.

In the same description field, there are two items in the bottom line: GPU-Util and Compute-M, which you need to match up with the corresponding data in the bottom line of the data field below it.

Does it mean that operations of one CUDA thread may be executed on different SPs?

For example:
A kernel:

__globale__ void k(void)
{
     A = B + C;
     D = A + E;
}

Then we launch k<<<1,2,0>>>();

Suppose that there’re 2 SPs, SP1 and SP2.

As I understand from the quoted statement, SP1 may execute A = B + C; of all threads and SP2 may execute D = A + E; of all threads, rather than each thread fully executed on one SP

All instructions are issued warp-wide. Always. That means when you ask for a 32-bit floating-point addition, the warp scheduler will select 32 SPs (or 16 SPs across two clock cycles) to issue the instruction.

A partial warp doesn’t use fewer than 32 functional units to issue an instruction.

In your example, the corresponding SASS code might look something like this:

FADD  R2, R0, R1
FADD  R4, R2, R3

The first FADD instruction will be dispatched to 32 SP units. The second instruction is dependent on the first, so it cannot be issued until the first instruction results are complete. All instructions are pipelined, so the pipeline depth might be something like 4 cycles. The warp will stall for 4 cycles. After 4 cycles, the second FADD will be dispatched to 32 SP units.

This general process is true independent of the number of threads per block. The threads per block will be broken into a set of 32-thread warps, and each warp will be handled this way.

Having only 2 active threads in a particular warp doesn’t change any of this. 32 SP units are still needed for scheduling.

Given this discussion, it is not reasonable to assume that a single SP unit would handle a particular operation (a particular instruction) for more than 1 thread in a warp in a given clock cycle.

In the case of Turing, for example, where there are only 16 SP units in a particular SM warp scheduler partition, the instruction issue step will take 2 clock cycles instead of 1. Likewise the results for that instruction will be produced in two subsequent clock cycles (each 4 cycles later than the issue, for this imagined pipeline depth). In that case, a single SP unit would presumably be handling 2 threads in the warp, albeit across two different (and presumably back-to-back) clock cycles.

If we observed the issuance of another instruction type, where there were only 4 supporting functional units in a particular SM partition, then we would surmise that the issuance of an instruction would require 8 clock cycles.

If we took this to the extreme, and found an instruction for which a particular warp scheduler partition only had one functional unit to support that instruction type, then we would presumably observe that instruction issuing over 32 clock cycles, and the same functional unit would effectively handle all the warp lanes, albeit each in different clock cycles. However this wouldn’t be the case for SP units. I’m not aware of any GPU that has fewer than 16 SP units in a warp scheduler partition. You may wish to review any of the architecture whitepapers.

None of these mechanics are any different if you have only 2 active threads in a particular warp.

You can get additional treatment of these ideas in the 3rd and 4th sessions in the online training available here.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.