SFU Performance in A100

Can anyone tell me how to measure the peak performance of the Special Functional Unit in A100 (GA100 Architecture) ?

The “Table 3. Throughput of Native Arithmetic Instructions” in CUDA C++ Programming Guide says the “Numbers of Results per Clock Cycles per Multiprocessor” of SFU is 16 in A100. Since A100 has 108SMs, and one SM has 16 SFUs. When it is running in 1410MHz, I think the peak performance should be 1410 * 108 * 16 = 1892.160GOPs. But I don’t know how to write codes to make SFU run at full speed. I have already googled but I find that most people only concern about how to accelrate GEMM programs and it seems no one pays attention to the performance of SFU. Any answer is welcome.

A100 is not GA102 architecture. (And there is no RTX A100 product.)

I’m a bit puzzled by your question because when I look at table 3 in the programming guide I don’t see any references to SFU or special function unit.

There are references to it elsewhere in the programming guide, but I think it would be good to clarify:

which instructions or operations are you referring to? (for example, pick one from the vertical axis of table 3)
which GPU are you referring to?

Sorry, I mean A100 Tensor Core GPU. Yes it is GA100.
In CUDA C++ Programming Guide, Design Guide, 5.4.1:

So you would write code that issues a lot of those operations.

Have you tried that?

Here is a similar example, that does not use SFU, but it gives the general idea for how a particular operation could be benchmarked for max throughput.

Here’s a possible example, I haven’t checked it carefully:

$ cat t2.cu
#include<iostream>
#include <stdio.h>

#define ITER 1024
#define SSZ 512
#define BSZ 512


#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void do_rsqrtf(float in) {
    for (int i = 0; i < ITER; i++)
      in = rsqrtf(in);
    if (!in) printf("out = %f\n", in);
}


int main() {
    do_rsqrtf<<<BSZ, SSZ>>>(1.0f);  // warm up
    cudaDeviceSynchronize();
    unsigned long long dt = dtime_usec(0);
    do_rsqrtf<<<BSZ*ITER, SSZ>>>(1.0f);
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    unsigned long long ops = ITER;
    ops *= ITER;
    ops *= SSZ;
    ops *= BSZ;
    float et = dt/(float)USECPSEC;
    unsigned long long Mops = ops/1000000;
    std::cout<<et<<"s for "<< Mops << " Mrsqrtf"<<std::endl;
    float tp = (Mops)/(et*1000000);
    std::cout << "throughput: " << tp << " Tops/s" << std::endl;
}
$ nvcc -arch=sm_80 -o t2 t2.cu
$ ./t2
0.191209s for 274877 Mrsqrtf
throughput: 1.43757 Tops/s
$ nvidia-smi
Fri Dec 10 09:15:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:C3:00.0 Off |                  Off |
| N/A   30C    P0    33W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$

The difference between 1.89 Tops and 1.43 Tops might be due to any number of factors. It’s generally difficult to hit a peak theoretical number, and I haven’t paid careful attention to clocking, and its possible I made a mistake of some sort. If I do cuobjdump -sass on the above executable, I see long sequences that look like this repeating:

        /*06f0*/                   MUFU.RSQ R2, R0 ;                                           /* 0x0000000000027308 */
                                                                                               /* 0x000e240000001400 */
        /*0700*/              @!P0 FMUL R2, R2, 4096 ;                                         /* 0x4580000002028820 */
                                                                                               /* 0x001fca0000400000 */
        /*0710*/                   FSETP.GEU.AND P0, PT, |R2|, 1.175494350822287508e-38, PT ;  /* 0x008000000200780b */
                                                                                               /* 0x000fda0003f0e200 */
        /*0720*/              @!P0 FMUL R2, R2, 16777216 ;                                     /* 0x4b80000002028820 */
                                                                                               /* 0x000fc80000400000 */
        /*0730*/                   MUFU.RSQ R3, R2 ;                                           /* 0x0000000200037308 */
                                                                                               /* 0x000e240000001400 */
        /*0740*/              @!P0 FMUL R3, R3, 4096 ;                                         /* 0x4580000003038820 */
                                                                                               /* 0x001fca0000400000 */
        /*0750*/                   FSETP.GEU.AND P0, PT, |R3|, 1.175494350822287508e-38, PT ;  /* 0x008000000300780b */
                                                                                               /* 0x000fda0003f0e200 */
        /*0760*/              @!P0 FMUL R3, R3, 16777216 ;                                     /* 0x4b80000003038820 */
                                                                                               /* 0x000fc80000400000 */

I would expect the throughput of MUFU.RSQ to be the limiting factor there, but it may be that the overall code and other instructions are contributing to a small reduction in conceivable throughput. There might be a more clever way to express the kernel code to improve it.

I tried compiling with --use_fast_math and the cuobjdump output then shows a long sequence of MUFU.RSQ:

        /*0fc0*/                   MUFU.RSQ R8, R7 ;                         /* 0x0000000700087308 */
                                                                             /* 0x001e300000001400 */
        /*0fd0*/                   MUFU.RSQ R8, R8 ;                         /* 0x0000000800087308 */
                                                                             /* 0x001e300000001400 */
        /*0fe0*/                   MUFU.RSQ R9, R8 ;                         /* 0x0000000800097308 */
                                                                             /* 0x001e300000001400 */
        /*0ff0*/                   MUFU.RSQ R9, R9 ;                         /* 0x0000000900097308 */
                                                                             /* 0x001e300000001400 */
        /*1000*/                   MUFU.RSQ R3, R9 ;                         /* 0x0000000900037308 */

In this case, the measured throughput increases to about 1.55Tops.

As Robert Crovella points out, it is important to examine the generated SASS (machine code) when benchmarking MUFU (SFU is an older term for MUFU) throughput. It would be a mistake to assume that CUDA device function intrinsics generally map to a single instruction. There are at minimum two reasons for this:

(1) Some MUFU operations require a prior range-reduction operation (RRO), although the need for these has been eliminated in some (all?) cases in newer architectures. For example, Ampere no longer requires RRO.EX2 prior to MUFU.EX2, if I am not mistaken.

(2) Most (all?) MUFU operations do not support subnormal (denormal) operands. Support for subnormal operands is achieved by short emulation sequences that include the MUFU operation plus instructions for pre- and post-processing as needed. When compiling with -ftz=true the need for an emulation wrapper is eliminated. -use_fast_math includes -ftz=true.

FWIW, there are people who care for MUFU throughput, because it is relevant to their uses case(s). But the ones I have met assessed performance at the application level, rather than benchmarking MUFU throughput in isolation.

Indeed, when compiling with -ftz=true in place of --use_fast_math, I also see just the MUFU.RSQ in the unrolled SASS area.

Really thanks for your example codes. But another question is how to set the value of “ITER/BSZ/SSZ” correctly? When I increase “ITER”, I just got
larger and larger results:

// ITER = 1024 * 2:
[0]
0.575386s for 1099511 Mrsqrtf
throughput: 1.91091 Tops/s
[1]
0.575688s for 1099511 Mrsqrtf
throughput: 1.90991 Tops/s
[2]
0.52887s for 1099511 Mrsqrtf
throughput: 2.07898 Tops/s
[3]
0.578748s for 1099511 Mrsqrtf
throughput: 1.89981 Tops/s

// ITER = 1024 * 3:
[0]
1.14141s for 2473901 Mrsqrtf
throughput: 2.16742 Tops/s
[1]
1.1424s for 2473901 Mrsqrtf
throughput: 2.16554 Tops/s
[2]
1.13995s for 2473901 Mrsqrtf
throughput: 2.17019 Tops/s
[3]
1.14111s for 2473901 Mrsqrtf
throughput: 2.16797 Tops/s

// ITER = 1024 * 4:
[0]
1.93196s for 4398046 Mrsqrtf
throughput: 2.27646 Tops/s
[1]
1.93181s for 4398046 Mrsqrtf
throughput: 2.27665 Tops/s
[2]
1.9321s for 4398046 Mrsqrtf
throughput: 2.2763 Tops/s
[3]
1.93368s for 4398046 Mrsqrtf
throughput: 2.27444 Tops/s

// ITER = 1024 * 5:
[0]
2.94744s for 6871947 Mrsqrtf
throughput: 2.3315 Tops/s
[1]
2.94951s for 6871947 Mrsqrtf
throughput: 2.32986 Tops/s
[2]
2.94985s for 6871947 Mrsqrtf
throughput: 2.3296 Tops/s
[3]
2.94839s for 6871947 Mrsqrtf
throughput: 2.33074 Tops/s

// ITER = 1024 * 6:
[0]
4.19215s for 9895604 Mrsqrtf
throughput: 2.36051 Tops/s
[1]
4.192s for 9895604 Mrsqrtf
throughput: 2.36059 Tops/s
[2]
4.19121s for 9895604 Mrsqrtf
throughput: 2.36104 Tops/s
[3]
4.18903s for 9895604 Mrsqrtf
throughput: 2.36227 Tops/s

// ITER = 1024 * 7:
[0]
5.66067s for 13469017 Mrsqrtf
throughput: 2.37941 Tops/s
[1]
5.65854s for 13469017 Mrsqrtf
throughput: 2.3803 Tops/s
[2]
5.65976s for 13469017 Mrsqrtf
throughput: 2.37979 Tops/s
[3]
5.65892s for 13469017 Mrsqrtf
throughput: 2.38014 Tops/s


// ITER = 1024 * 8:
[0]
7.33901s for 17592186 Mrsqrtf
throughput: 2.39708 Tops/s
[1]
7.32566s for 17592186 Mrsqrtf
throughput: 2.40145 Tops/s
[2]
7.32868s for 17592186 Mrsqrtf
throughput: 2.40046 Tops/s
[3]
7.32378s for 17592186 Mrsqrtf
throughput: 2.40207 Tops/s

MY GPU is:

Sat Dec 11 10:31:27 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   34C    P0    38W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:81:00.0 Off |                    0 |
| N/A   35C    P0    38W / 250W |      0MiB / 40536MiB |     31%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If I read the specs for the A100 PCIe correctly, the theoretical throughput for MUFU ops should be 2.43Top/sec, which your series of experiments seems to approach asymptotically. This suggest some fixed-size overhead (on the order of 0.14 seconds) whose impact diminishes with longer kernel run time.

Not sure where that overhead is coming from, I see the same performance of about 1.06 Fop/sec reported on my Quadro RTX 4000 (under Windows 10) with ITER, ITER*2, and ITER*3. The theoretical throughput at the achieved clock boost frequency of 1890 MHz would be 1.09 Top/sec if my math is right.

A good starting point for block size is 128 threads, since generally smaller granularity blocks should be preferred for maximum occupancy. But for a trivial kernel like this it should not really matter. Choosing the grid size as an integer multiple of the SM count (at least 100x) is a reasonable way to apply balanced continuous load to all execution units. The A100 has 108 SMs.

I am not sure what the significance of the sizing BSZ*ITER in the posted code is. It seems to be an attempt to increase potential parallelism as the length of the dependency chain in the kernel grows? Not sure why that is necessary.

agreed. This math is simply incorrect:

I guess I should have check that arithmetic.

The correct number is 2.43 Tops. And with some fairly minor changes to the code I posted, it seems to approach that number as already indicated and suggested. So it may be just a matter of a long-enough warm up to boost the core clock up to a high value.

$ cat t2.cu
#include<iostream>
#include <stdio.h>

#define ITER 1024
#define SSZ 512
#define BSZ 512


#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void do_rsqrtf(float in) {
    for (int i = 0; i < ITER; i++)
      in = rsqrtf(in);
    if (!in) printf("out = %f\n", in);
}


int main() {
    do_rsqrtf<<<BSZ, SSZ>>>(1.0f);  // warm up
    do_rsqrtf<<<BSZ*ITER, SSZ>>>(1.0f);
    cudaDeviceSynchronize();
    unsigned long long dt = dtime_usec(0);
    do_rsqrtf<<<BSZ*ITER, SSZ>>>(1.0f);
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    unsigned long long ops = ITER;
    ops *= ITER;
    ops *= SSZ;
    ops *= BSZ;
    float et = dt/(float)USECPSEC;
    unsigned long long Mops = ops/1000000;
    std::cout<<et<<"s for "<< Mops << " Mrsqrtf"<<std::endl;
    float tp = (Mops)/(et*1000000);
    std::cout << "throughput: " << tp << " Tops/s" << std::endl;
}
$ nvcc -arch=sm_80 -o t2 t2.cu --ftz=true
$ ./t2 && ./t2
0.153161s for 274877 Mrsqrtf
throughput: 1.79469 Tops/s
0.112936s for 274877 Mrsqrtf
throughput: 2.43392 Tops/s
$

Anyhow, you should now have enough information to draw your own conclusions and do your own benchmarking.

(post deleted by author)

The 1892.16GOPs is calculated by using 1095MHz, which is the Base Clock of A100. Its Boost Clock is 1410MHz, so the theoretical max throughput is 1410 * 108 * 16 = 2.436480 Tops/s. I’m sorry that my memory went run.