Blackwell Integer

aaelick · February 2, 2025, 8:24pm

deviceQuery.exe Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “NVIDIA GeForce RTX 5080”
CUDA Driver Version / Runtime Version 12.8 / 12.3
CUDA Capability Major/Minor version number: 12.0
Total amount of global memory: 16303 MBytes (17094475776 bytes)
MapSMtoCores for SM 12.0 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 12.0 is undefined. Default to use 128 Cores/SM
(084) Multiprocessors, (128) CUDA Cores/MP: 10752 CUDA Cores
GPU Max Clock rate: 2730 MHz (2.73 GHz)
Memory Clock rate: 15001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 67108864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.8, CUDA Runtime Version = 12.3, NumDevs = 1
Result = PASS

Robert_Crovella · February 2, 2025, 8:35pm

Thanks! It helps to settle questions around the compute capability of the RTX 50xx series GPUs.

aaelick · February 2, 2025, 8:42pm

What kernel to run then to see 2x int32 difference between 5080 and 4080?
All my real programs complex int32 kernels don’t show this difference.
Not just this simple test.

Robert_Crovella · February 2, 2025, 10:07pm

It’s not a trivial matter to write such a test.

You could try something like this. On my L4 GPU it gets me into the 6-7 TOps/s range, and studying the SASS (for cc8.9) shows an inner loop that is mostly integer adds. I haven’t really tried to make the calculation of TOps/s correct so don’t use this as an actual calibrated measurement, but if you run this on a GPU that has twice the integer throughput, the reported number should be twice as large. You’ll want to set num_SM to the number of multiprocessors in the GPU you are running on (can be gotten from deviceQuery) and compile for cc8.9 if running on Ada GPU (such as 4080) or cc12.0 for your 5080. Note that your machine is not configured correctly to run this test on a cc12.0 GPU, at least not to my liking (it should work, but it will be hard to analyze the SASS if needed):

That means you have proper driver support (12.8) for your 5080 but not proper CUDA toolkit support (12.3) for your 5080. If you compile with PTX (e.g. cc8.9 PTX), then the driver should JIT the code for your 5080, and things should “work”, but you wouldn’t be able to easily inspect the SASS. It would be better to actually load CUDA toolkit 12.8 on that machine, and compile explicitly for cc12.0

And you’ll have to modify the timing code for your windows machine.

Here is my test case, use at your own risk, does not give calibrated results:

# cat t335.cu
#include <iostream>
#include <cstdlib>
#include <time.h>
#include <sys/time.h>

#define USECPSEC 1000000ULL


unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const int AL = 4;
template <typename T>
__global__ void addKernel(T * __restrict__  c, const T * __restrict__  a, const int l)
{
  T b[AL];
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = 0; i < AL; i++) b[i] = a[index + i * gridDim.x*blockDim.x];
  T val = 0;
  #pragma unroll AL
  for (int i = 0; i < l; i++)
  {
    val +=  b[i];
  }
  if (val == 0) c[index] = val;
}

using mt = int;
const int num_SM = 58;
const int blk_per_SM = 3;

const int nBLK = blk_per_SM*num_SM;
const int nTPB = 512;
const int mnr = 8192;
int main(int argc, char *argv[]){

  int nr = mnr;
  if (argc > 1) nr = atoi(argv[1]);
  mt *c, *a;
  cudaMalloc(&a, AL*nBLK*nTPB*sizeof(a[0]));
  cudaMemset(a,1, AL*nBLK*nTPB*sizeof(a[0]));
  cudaMalloc(&c, nBLK*nTPB*sizeof(c[0]));
  addKernel<<<nBLK, nTPB>>>(c, a, nr); // warm-up
  cudaDeviceSynchronize();
  unsigned long long dt = dtime_usec(0);
  addKernel<<<nBLK, nTPB>>>(c, a, nr);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  unsigned long long iops  = nBLK*nTPB*(unsigned long long)nr;
  unsigned long long iopps = iops/dt;
  std::cout << "usec: " << dt << std::endl;
  std::cout << "iops: " << iops << std::endl;
  std::cout << "iops/s: " << iopps/1e6 << "TOps/s" << std::endl;
}
# nvcc -arch=sm_89 t335.cu -o t335 -lineinfo -Xptxas=-v
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z9addKernelIiEvPT_PKS0_i' for 'sm_89'
ptxas info    : Function properties for _Z9addKernelIiEvPT_PKS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 372 bytes cmem[0]
# ./t335 100000
usec: 1238
iops: 8908800000
iops/s: 7.19612TOps/s
#

CUDA 12.2 SASS:

        code for sm_89
                Function : _Z9addKernelIiEvPT_PKS0_i
        .headerflags    @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;           /* 0x00000a00ff017624 */
                                                                                     /* 0x000fc400078e00ff */
        /*0010*/                   S2R R0, SR_CTAID.X ;                              /* 0x0000000000007919 */
                                                                                     /* 0x000e220000002500 */
        /*0020*/                   IMAD.MOV.U32 R11, RZ, RZ, c[0x0][0x170] ;         /* 0x00005c00ff0b7624 */
                                                                                     /* 0x000fe200078e00ff */
        /*0030*/                   ULDC.64 UR6, c[0x0][0x118] ;                      /* 0x0000460000067ab9 */
                                                                                     /* 0x000fe20000000a00 */
        /*0040*/                   IMAD.MOV.U32 R5, RZ, RZ, c[0x0][0x0] ;            /* 0x00000000ff057624 */
                                                                                     /* 0x000fe200078e00ff */
        /*0050*/                   S2R R3, SR_TID.X ;                                /* 0x0000000000037919 */
                                                                                     /* 0x000e220000002100 */
        /*0060*/                   IMAD.MOV.U32 R13, RZ, RZ, 0x4 ;                   /* 0x00000004ff0d7424 */
                                                                                     /* 0x000fe200078e00ff */
        /*0070*/                   ISETP.GE.AND P0, PT, R11, 0x1, PT ;               /* 0x000000010b00780c */
                                                                                     /* 0x000fe20003f06270 */
        /*0080*/                   IMAD R0, R0, c[0x0][0x0], R3 ;                    /* 0x0000000000007a24 */
                                                                                     /* 0x001fc800078e0203 */
        /*0090*/                   IMAD R4, R5, c[0x0][0xc], R0 ;                    /* 0x0000030005047a24 */
                                                                                     /* 0x000fc800078e0200 */
        /*00a0*/                   IMAD R6, R5, c[0x0][0xc], R4 ;                    /* 0x0000030005067a24 */
                                                                                     /* 0x000fc800078e0204 */
        /*00b0*/                   IMAD R8, R5, c[0x0][0xc], R6 ;                    /* 0x0000030005087a24 */
                                                                                     /* 0x000fe200078e0206 */
        /*00c0*/              @!P0 BRA 0x350 ;                                       /* 0x0000028000008947 */
                                                                                     /* 0x000fea0003800000 */
        /*00d0*/                   IMAD.WIDE.U32 R4, R4, R13, c[0x0][0x168] ;        /* 0x00005a0004047625 */
                                                                                     /* 0x000fc800078e000d */
        /*00e0*/                   IMAD.WIDE.U32 R6, R6, R13.reuse, c[0x0][0x168] ;  /* 0x00005a0006067625 */
                                                                                     /* 0x080fe400078e000d */
        /*00f0*/                   LDG.E.CONSTANT R5, [R4.64] ;                      /* 0x0000000604057981 */
                                                                                     /* 0x000164000c1e9900 */
        /*0100*/                   IMAD.WIDE.U32 R8, R8, R13.reuse, c[0x0][0x168] ;  /* 0x00005a0008087625 */
                                                                                     /* 0x080fe400078e000d */
        /*0110*/                   LDG.E.CONSTANT R6, [R6.64] ;                      /* 0x0000000606067981 */
                                                                                     /* 0x000164000c1e9900 */
        /*0120*/                   IMAD.WIDE.U32 R2, R0, R13, c[0x0][0x168] ;        /* 0x00005a0000027625 */
                                                                                     /* 0x000fe400078e000d */
        /*0130*/                   LDG.E.CONSTANT R9, [R8.64] ;                      /* 0x0000000608097981 */
                                                                                     /* 0x000168000c1e9900 */
        /*0140*/                   LDG.E.CONSTANT R2, [R2.64] ;                      /* 0x0000000602027981 */
                                                                                     /* 0x000162000c1e9900 */
        /*0150*/                   IADD3 R10, R11, -0x1, RZ ;                        /* 0xffffffff0b0a7810 */
                                                                                     /* 0x000fc80007ffe0ff */
        /*0160*/                   ISETP.GE.U32.AND P0, PT, R10, 0x3, PT ;           /* 0x000000030a00780c */
                                                                                     /* 0x000fe20003f06070 */
        /*0170*/                   UMOV UR4, URZ ;                                   /* 0x0000003f00047c82 */
                                                                                     /* 0x000fe20008000000 */
        /*0180*/                   LOP3.LUT R11, R11, 0x3, RZ, 0xc0, !PT ;           /* 0x000000030b0b7812 */
                                                                                     /* 0x000fe200078ec0ff */
        /*0190*/                   IMAD.MOV.U32 R10, RZ, RZ, RZ ;                    /* 0x000000ffff0a7224 */
                                                                                     /* 0x000fc600078e00ff */
        /*01a0*/                   ISETP.NE.AND P1, PT, R11, RZ, PT ;                /* 0x000000ff0b00720c */
                                                                                     /* 0x000fce0003f25270 */
        /*01b0*/              @!P0 BRA 0x230 ;                                       /* 0x0000007000008947 */
                                                                                     /* 0x000fec0003800000 */
        /*01c0*/                   IADD3 R3, -R11, c[0x0][0x170], RZ ;               /* 0x00005c000b037a10 */
                                                                                     /* 0x001fc80007ffe1ff */
        /*01d0*/                   IADD3 R3, R3, -0x4, RZ ;                          /* 0xfffffffc03037810 */
                                                                                     /* 0x000fe20007ffe0ff */
        /*01e0*/                   UIADD3 UR4, UR4, 0x4, URZ ;                       /* 0x0000000404047890 */
                                                                                     /* 0x000fe2000fffe03f */
        /*01f0*/                   IADD3 R10, R5, R2, R10 ;                          /* 0x00000002050a7210 */
                                                                                     /* 0x020fe40007ffe00a */
        /*0200*/                   ISETP.NE.AND P0, PT, R3, RZ, PT ;                 /* 0x000000ff0300720c */
                                                                                     /* 0x000fe40003f05270 */
        /*0210*/                   IADD3 R10, R9, R6, R10 ;                          /* 0x00000006090a7210 */
                                                                                     /* 0x000fd60007ffe00a */
        /*0220*/               @P0 BRA 0x1d0 ;                                       /* 0xffffffa000000947 */
                                                                                     /* 0x000fea000383ffff */
        /*0230*/              @!P1 BRA 0x330 ;                                       /* 0x000000f000009947 */
                                                                                     /* 0x001fea0003800000 */
        /*0240*/                   USHF.L.U32 UR4, UR4, 0x2, URZ ;                   /* 0x0000000204047899 */
                                                                                     /* 0x000fcc000800063f */
        /*0250*/                   MOV R4, UR4 ;                                     /* 0x0000000400047c02 */
                                                                                     /* 0x000fc80008000f00 */
        /*0260*/                   ISETP.EQ.AND P1, PT, R4.reuse, RZ, PT ;           /* 0x000000ff0400720c */
                                                                                     /* 0x040fe40003f22270 */
        /*0270*/                   ISETP.EQ.AND P2, PT, R4.reuse, 0x4, PT ;          /* 0x000000040400780c */
                                                                                     /* 0x040fe40003f42270 */
        /*0280*/                   ISETP.EQ.AND P3, PT, R4.reuse, 0x8, PT ;          /* 0x000000080400780c */
                                                                                     /* 0x040fe40003f62270 */
        /*0290*/                   IADD3 R11, R11, -0x1, RZ ;                        /* 0xffffffff0b0b7810 */
                                                                                     /* 0x000fe40007ffe0ff */
        /*02a0*/                   ISETP.EQ.AND P0, PT, R4.reuse, 0xc, PT ;          /* 0x0000000c0400780c */
                                                                                     /* 0x040fe40003f02270 */
        /*02b0*/                   IADD3 R4, R4, 0x4, RZ ;                           /* 0x0000000404047810 */
                                                                                     /* 0x000fc60007ffe0ff */
        /*02c0*/               @P1 IMAD.MOV.U32 R3, RZ, RZ, R2 ;                     /* 0x000000ffff031224 */
                                                                                     /* 0x020fe200078e0002 */
        /*02d0*/                   ISETP.NE.AND P1, PT, R11, RZ, PT ;                /* 0x000000ff0b00720c */
                                                                                     /* 0x000fe20003f25270 */
        /*02e0*/               @P2 IMAD.MOV.U32 R3, RZ, RZ, R5 ;                     /* 0x000000ffff032224 */
                                                                                     /* 0x000fe400078e0005 */
        /*02f0*/               @P3 IMAD.MOV.U32 R3, RZ, RZ, R6 ;                     /* 0x000000ffff033224 */
                                                                                     /* 0x000fc800078e0006 */
        /*0300*/               @P0 MOV R3, R9 ;                                      /* 0x0000000900030202 */
                                                                                     /* 0x000fca0000000f00 */
        /*0310*/                   IMAD.IADD R10, R10, 0x1, R3 ;                     /* 0x000000010a0a7824 */
                                                                                     /* 0x000fe200078e0203 */
        /*0320*/               @P1 BRA 0x260 ;                                       /* 0xffffff3000001947 */
                                                                                     /* 0x000fea000383ffff */
        /*0330*/                   ISETP.NE.AND P0, PT, R10, RZ, PT ;                /* 0x000000ff0a00720c */
                                                                                     /* 0x000fda0003f05270 */
        /*0340*/               @P0 EXIT ;                                            /* 0x000000000000094d */
                                                                                     /* 0x000fea0003800000 */
        /*0350*/                   IMAD.WIDE R2, R0, R13, c[0x0][0x160] ;            /* 0x0000580000027625 */
                                                                                     /* 0x020fca00078e020d */
        /*0360*/                   STG.E [R2.64], RZ ;                               /* 0x000000ff02007986 */
                                                                                     /* 0x000fe2000c101906 */
        /*0370*/                   EXIT ;                                            /* 0x000000000000794d */
                                                                                     /* 0x000fea0003800000 */
        /*0380*/                   BRA 0x380;                                        /* 0xfffffff000007947 */
                                                                                     /* 0x000fc0000383ffff */

(the inner loop starts at 01b0, to 0220, and the first 2 IADD3 instructions are for loop control, not the arithmetic of interest. I haven’t done a proper job of counting actual integer ops here.)
I threw this together “hastily”. It may have defects. Someone else may have something better.

Later: Don’t use this code, it has a bug, discussed later.

njuffa · February 3, 2025, 1:48am

Below is another quickly constructed integer-op throughput benchmark. This kernel should fit in the ICache, the memory size requirements are modest and suitable for pretty much all modern GPUs, and the bandwidth requirements are modest, making it compute bound. Only the simplest integer ops are used (i.e. no shifts, no multiplies). On my Quadro RTX 4000 the results look plausible. I can see from GPU-Z output that the GPU clocks up to maximum boost clock of 1920 MHz while this short code is running. 1920 MHz * 36 SMs * 32 = 2.211 Tiops/sec. The output of the test program jives with that:

running on device 0 (Quadro RTX 4000)
intbench: using 256 threads per block, 390625 blocks
intbench: mintime = 21.464 msec  throughput = 2.24 Tiops/sec
res[0] = 00000000

#include <cstdio>
#include <cstdlib>
#include <cstdint>

#define INTBENCH_THREADS  (256)
#define INTBENCH_ITER     (10)
#define INTBENCH_ARR_LEN  (100000000)
#define INTBENCH_REPEAT   (12)
#define DEVICE_ORDINAL    (0)

const int REPEAT = INTBENCH_REPEAT;

__device__ uint32_t lop3_14 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
    asm ("lop3.b32 %0,%1,%2,%3,0x14;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
    return r;
}

__device__ uint32_t lop3_d2 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
    asm ("lop3.b32 %0,%1,%2,%3,0xd2;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
    return r;
}

__device__ uint32_t lop3_28 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
    asm ("lop3.b32 %0,%1,%2,%3,0x28;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
    return r;
}

__device__ uint32_t lop3_e4 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
    asm ("lop3.b32 %0,%1,%2,%3,0xe4;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
    return r;
} 

__device__ uint32_t lop3_f4 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
    asm ("lop3.b32 %0,%1,%2,%3,0xf4;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
    return r;
} 

#define MSB_MASK (0x80808080U)  // mask for msb of each byte

__device__ uint32_t masked_sign_to_byte_mask (uint32_t a)
{
    asm ("prmt.b32 %0,%0,0,0xba98;" : "+r"(a)); // convert MSBs to masks
    return a;
}

__device__ uint32_t masked_select (uint32_t a, uint32_t b, uint32_t m)
{
    return lop3_e4 (a, b, m);
}

__device__ uint32_t my_vaddss4 (uint32_t a, uint32_t b)
{
    uint32_t sum, res, ofl, sga, msk;
    res = (a & ~MSB_MASK) + (b & ~MSB_MASK);
    sum = a ^ b;
    ofl = lop3_14 (res, a, sum);         // ofl = (res ^ a) & ~sum
    sga = masked_sign_to_byte_mask (a);  // sign(a)-mask
    msk = masked_sign_to_byte_mask (ofl);// overflow-mask
    res = lop3_d2 (res, ~MSB_MASK, sum); // res = res ^ (MSB_MASK & sum)
    sga = lop3_28 (sga, ~MSB_MASK, msk); // sga = (sga ^ ~MSB_MASK) & msk
    res = lop3_f4 (sga, res, msk);       // res = sga | (res & ~msk)
    return res;
}

__device__ uint32_t my_vsubss4 (uint32_t a, uint32_t b)
{
    uint32_t r, s, t;
    r = a | MSB_MASK;                     // LOP3
    s = b &~ MSB_MASK;                    // LOP3
    r = r - s;                            // IADD
    t = (b ^ a) & MSB_MASK;               // LOP3 
    s = t & (r ^ a);                      // LOP3 // msb indicates overflow
    r = r ^ t ^ MSB_MASK;                 // LOP3 // regular result
    t = masked_sign_to_byte_mask (s);     // PRMT
    s = ~MSB_MASK + ((a & MSB_MASK) >> 7);// LOP3, LEA // potential special res.
    r = masked_select (s, r, t);          // LOP3 //select spec. or reg. result
    return r;
}

__global__ void kernel (const uint32_t * __restrict__ src, 
                        uint32_t * __restrict__ dst,
                        uint32_t incr1,
                        uint32_t incr2,
                        int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        uint32_t s = src[i];
        uint32_t t = s;
#pragma unroll REPEAT
        for (int k = 0; k < REPEAT; k++) {
            s = my_vaddss4 (s, incr1);
            t = my_vsubss4 (t, incr2);
        }
#pragma unroll REPEAT
        for (int k = 0; k < REPEAT; k++) {
            s = my_vsubss4 (s, incr2);
            t = my_vaddss4 (t, incr1);
        }
        dst[i] = s - t;
    }
}    

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

int main (void)
{
    double start, stop, elapsed, mintime;
    uint32_t *d_arg = 0, *d_res = 0;
    const int len = INTBENCH_ARR_LEN;
    struct cudaDeviceProp props;

    CUDA_SAFE_CALL (cudaGetDeviceProperties (&props, DEVICE_ORDINAL));
    printf ("running on device %d (%s)\n", DEVICE_ORDINAL, props.name);

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_arg, sizeof(d_arg[0]) * len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_res, sizeof(d_res[0]) * len));

    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_arg, 0x00, sizeof(d_arg[0]) * len));
    CUDA_SAFE_CALL (cudaMemset(d_res, 0xff, sizeof(d_res[0]) * len));

   /* Compute execution configuration */
    dim3 dimBlock(INTBENCH_THREADS);
    int threadBlocks = (len + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);

    printf ("intbench: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);
    fflush (stdout);

    mintime = fabs(log(0.0));
    for (int k = 0; k < INTBENCH_ITER; k++) {
        cudaDeviceSynchronize();
        start = second();
        kernel<<<dimGrid,dimBlock>>>(d_arg, d_res, 0x01020304, 0x01010101, len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("intbench: mintime = %.3f msec  throughput = %.2f Tiops/sec\n",
            1000 * mintime, ((4 * REPEAT * 10) * 1e-12 * len) / mintime);
    fflush(stdout);

    /* sample the result */
    uint32_t res_0;
    CUDA_SAFE_CALL (cudaMemcpy (&res_0, d_res, sizeof (res_0), cudaMemcpyDeviceToHost));
    printf ("res[0] = %08x\n", res_0);
    
    CUDA_SAFE_CALL (cudaFree(d_arg));
    CUDA_SAFE_CALL (cudaFree(d_res));

    return EXIT_SUCCESS;
}

njuffa · February 3, 2025, 2:11am

If I understand the RTX 5080 specs correctly, full-throughput INT32 ops would equate to 2617 MHz (boost) * 84 SMs * 128 = 28.1 Tiops/sec, consistent with the 56.3 FP32 TFLOPS (boost) that I see mentioned in online reviews.

I see from the device properties output posted earlier that the RTX 5080 has a maximum boost clock of 2730 MHz; it’s not clear whether it can reach that at present.

rs277 · February 3, 2025, 3:07am

Thanks Norbert.

For what it’s worth, which is not much in the context of this thread, apart from it being the last GPU to have INT32 cores == FP32 cores, here’s the result on on a 6GB GTX1060 (10 SMs):

running on device 0 (NVIDIA GeForce GTX 1060 6GB)
intbench: using 256 threads per block, 390625 blocks
intbench: mintime = 23.737 msec  throughput = 2.02 Tiops/sec
res[0] = 00000000

Which is not far away from your RTX4000.

njuffa · February 3, 2025, 3:14am

I forgot to state that my quick & dirty benchmark is based on device functions that should map to 10 instructions (all at maximum INT32 throughput) each for architectures >= sm_70. One would have to recalibrate this (by inspection of the generated SASS) for older architectures.

That said, the specifications of the GTX 1060 suggest integer throughput of around 2.2 Tiops/sec, so we could call this “close enough”.

aaelick · February 3, 2025, 10:43am

running on device 0 (NVIDIA GeForce RTX 4080)
intbench: using 256 threads per block, 390625 blocks
intbench: mintime = 3.926 msec throughput = 12.23 Tiops/sec
res[0] = 00000000

running on device 0 (NVIDIA GeForce RTX 5080)
intbench: using 256 threads per block, 390625 blocks
intbench: mintime = 3.099 msec throughput = 15.49 Tiops/sec
res[0] = 00000000

Still no 2x increase for 5080. Memory bound probably also as my test.

njuffa · February 3, 2025, 11:02am

Thanks for collecting this data.

My benchmark is designed to be not memory bound. Per 8 bytes of GMEM traffic it executes 478 instructions per disassembled SASS. This means 15.49 Tiops/sec equate to 260 GB/sec, whereas the theoretical bandwidth of the RTX 5080 is 960 GB/sec, of which about 860 GB/sec should be available in real-life scenarios.

Your numbers are roughly in line with an assumed clock boost to 2730 MHz * 84 SMs * 64 = 14.67 Tiops/sec. Maybe the latest compilers are able to trim a few instructions from the kernel, slightly exaggerating the reported performance.

aaelick · February 3, 2025, 11:10am

4080:
usec: 886
iops: 729808896
iops/s: 8.23712TOps/s

5080:
usec: 705
iops: 729808896
iops/s: 10.3519TOps/s

Compiled with sm_89 on 12.8. Same result as everywhere I see. Some increase, but very far away from 2x.
So, no x2 yet, even if not memory bound…

ps: It was just deviceQuery output I sent earlier from old deviceQuery 12.3 build. Now I compiling all on 12.8 of course. BTW, deviceQuery sample github source code should be updated actually to include Blackwell device code (MapSMtoCores for SM 12.0 is undefined).

njuffa · February 3, 2025, 11:43am

From what I gathered from the few online reviews that cover compute tasks, the performance ratio RTX 4090 → RTX 5090 came out to between 1.27x to 1.42x. That is obviously with an unknown instruction mix and with various degrees of memory boundedness.

All good things must come to an end. Moore’s Law is dead, and the available microarchitectural cards have pretty much all been played at this point, so I don’t think it is realistic to expect more than this for a general purpose mixed-use part.

Processors optimized for particular application areas (to the detriment of others), e.g. AI-focused, classical HPC focused, graphics focused, could result in further performance increases.

aaelick · February 3, 2025, 11:50am

Yeah, but Blackwell promised 2x in int32 over Ada in all marketing materials, which currently not the case. Currently I see only speed increase because of higher clock speed and more threads, no architectural difference at all. Probably need time for updated drivers and/or cuda runtime. I don’t know, but It’ll be nice to see x2.

njuffa · February 3, 2025, 12:12pm

Frankly, I generally do not pay attention to marketing claims. How did Shakespeare put it? “… full of sound and fury, signifying nothing”. Facts on the ground is the only thing that counts.

Maybe (speculation!) there is some special operations path (like tensor cores) that can be used to demonstrate such a 2x claim. I have not paid attention because based on CapEx and OpEx these products are not for me.

Curefab · February 3, 2025, 1:06pm

Besides the available larger die area chips and small floating-point types and thread block clusters, the doubled integer performance was the main improvement of Blackwell over Hopper/Ada. And the doubled performance is stated in the datasheets, which is a bit more concrete than marketing text.

page 50, Peak INT32 TOPS (non-Tensor) - RTX 5090: 104.8 TOPs; RTX 5080: 56.3 TOPs
The same as FP32 performance.

Instructions per second would be half as many, due to IMAD doing multiplication+accumulation. So we would expect 28.15 in boost mode instead of 10.3 TOPs/s.

Could you run the benchmarks with FP32 as comparison? Then we can exclude the influence of memory and other factors.

So normally just one specific pipeline is used for INT32 and only with certain instructions the other or both are used?

Something like

Or alternatively, as arithmetic computations have fixed latencies, ptxas has to be updated to better use both pipelines? However, the benchmark should hide those latencies with the number of threads?

aaelick · February 3, 2025, 1:21pm

I can run any benchmark to see the difference between 4080 and 5080. Just send source code.

njuffa · February 3, 2025, 1:21pm

That would be the obvious way to get a factor of two :-) I guess I shouldn’t tackle these things in the middle of the night.

But in order to get to 56T integer operations with IMADs, one would still need a throughput of 28T IMAD instructions per second, and so far only half that has been demonstrated for INT32 instructions. Is there possibly an updated flavor of the DP4 instruction (or something of that nature) that gets us there?

It will probably be best to simply wait for NVIDIA to update the throughput tables in the documentation.

Curefab · February 3, 2025, 1:41pm

This remark was just to put the datasheet numbers (56T) and the benchmark numbers (10T) in relation to each other. Not about your post, which was spot on.

Just replace int with float, same data width.
For Robert’s kernel it is the line:

using mt = int;

The rest is templated.

Norbert’s kernel uses lop3, which of course cannot be easily ported to fp32.

aaelick · February 3, 2025, 3:33pm

5080 int32:

sm_89:
usec: 792
iops: 729808896
iops/s: 9.21475 TOps/s

sm_100:
usec: 705
iops: 729808896
iops/s: 10.3519 TOps/s

sm_101:
usec: 725
iops: 729808896
iops/s: 10.0663 TOps/s

sm_120:
usec: 685
iops: 729808896
iops/s: 10.6541 TOps/s

5080 float:

sm_89:
usec: 829
iops: 729808896
iops/s: 8.80348 TOps/s

sm_100:
usec: 829
iops: 729808896
iops/s: 8.80348 TOps/s

sm_101:
usec: 837
iops: 729808896
iops/s: 8.71934 TOps/s

sm_120:
usec: 818
iops: 729808896
iops/s: 8.92186 TOps/s

Curefab · February 3, 2025, 3:38pm

So even float is too slow! Could you start it with Nsight Compute (both integer and float), please?

Topic		Replies	Views
Forward looking GPU integer performance CUDA Programming and Performance	22	21557	March 20, 2017
Accuracy of 1D linear interpolation by CUDA texture interpolation CUDA Programming and Performance	25	14627	January 29, 2013
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26918	December 23, 2014
So what's new about Maxwell? CUDA Programming and Performance	166	55910	March 10, 2015
Bug with integer division? CUDA Programming and Performance	33	9350	September 9, 2015
Yet another performance question CUDA Programming and Performance	16	4229	February 12, 2009
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	19923	March 12, 2014
Bf16 slower than fp32 on A10 and A100? CUDA Programming and Performance cuda , kernel , deep-learning , a100	4	1058	July 13, 2024
Performance Difference DP4A vs VMIN4 using GTX1060 CUDA Programming and Performance	33	1901	May 28, 2021
Cuda 7.5 give a 30% performance loss vs cuda 6.5 CUDA Programming and Performance	33	13479	May 11, 2016

Blackwell Integer

Related topics