Blackwell Integer

Maybe the int32 doesnt have the bandwith and the cache isnt fast enough with 64bit words. Have yall tried 32 bit words? Just a suggestion.

Im not saying its a design flaw but maybe the cache just is saturated in the core design between fp32 and int32 somehow.

Its possible its just setting idle waiting stuffering ( stuttering and buffering)? Anyway to measure warp thread saturation?

If it aint using any global memmory gpu memmory and only cache could 64 bits cause a slowdown?

You mean accessing the memory with 64 bits as in Robertā€™s SASS code? With the Cuda architectures I know or tested, it has the same speed (or is even slightly faster) as 32 bit accesses.

my posted code has a bug in it. probably best not to base anything on it. Not sure when I will be able to look at it again. i in the last loop can/will exceed the extent of b. I donā€™t understand the references to 64-bits.

Here is a quick & dirty FP32 throughput test. It can only demonstrate most of the theoretical throughput. While memory boundedness has been avoided by design, there is some non-FFMA overhead, and the (outer) loopā€™s body may exceed the ICache. The output for my Quadro RTX 4000 is:

running on device 0 (Quadro RTX 4000)
using 256 threads per block, 524160 blocks, 1.073480 GB used
flop= 1.649670e+12  elapsed=0.19231 sec  throughput=8.57807 FP32 TFLOPS

Note that running this benchmark may cause perf capping due to voltage stability; on my RTX 4000 GPU-Z reports ā€œPerfCap Vrelā€. It is also advisable to run this with a cool GPU that has been idling for a while. In my test I started out with the GU at 42-43 degrees Celsius.

According to GPU-Z monitoring, my GPU boosted to 1905 MHz while running this test, so the theoretical FP32 FLOPS are 1905 MHz * 36 SMs * 64 *2 = 8.778 TFLOPS.

[code below updated 2/3/2025 9pm PST]

#include <cstdlib>
#include <cstdio>

#define DEVICE_ORDINAL  (0)
#define THREADS_PER_BLK (256)
#define LEN             (65520 * 1024 * 2)
#define POLY_DEPTH      (1536)
#define REPS            (2)
#define ITER            (10)

const int DEPTH = POLY_DEPTH;

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

__global__ void kernel (const float * __restrict__ src, 
                        float * __restrict__ dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        float p = src[i] + 1.000001f;
        float q = src[i] + 1.000002f;
        for (int k = 0; k < REPS; k++) {
#pragma unroll DEPTH
            for (int j = 0; j < DEPTH; j++) {
                p = fmaf (p, p, 1.000001f);
                q = fmaf (q, q, 1.000002f);
            }
        }
        dst[i] = p + q;
    }
}    

int main (void)
{
    double start, stop, elapsed, mintime=1e308, nbr_of_fma;
    float *d_a, *d_b;
    struct cudaDeviceProp props;

    CUDA_SAFE_CALL (cudaGetDeviceProperties (&props, DEVICE_ORDINAL));
    printf ("running on device %d (%s)\n", DEVICE_ORDINAL, props.name);

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * LEN));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * LEN));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * LEN)); // zero

    /* Compute execution configuration */
    dim3 dimBlock(THREADS_PER_BLK);
    int threadBlocks = (LEN + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);
    
    printf ("using %d threads per block, %d blocks, %f GB used\n", 
            dimBlock.x, dimGrid.x, 2*1e-9*LEN*sizeof(d_a[0]));

    for (int k = 0; k < ITER; k++) {
        cudaDeviceSynchronize();
        start = second();
        kernel<<<dimGrid,dimBlock>>>(d_a, d_b, LEN);
        CHECK_LAUNCH_ERROR();
        stop = second();
    }
    elapsed= stop - start;
    if (elapsed < mintime) { mintime = elapsed; }
    nbr_of_fma = (2.0 * DEPTH * REPS + 3.0) * LEN;
    printf ("flop=%13.6e  elapsed=%.5f sec  throughput=%.5f FP32 TFLOPS\n", 
            nbr_of_fma * 2, mintime, nbr_of_fma * 2 *1e-12/mintime);

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;
}

The reputed size for Turing is expressed in Table 3.1, P.22 here.

@rs277 Thanks for the pointer. Turns out I grabbed the wrong piece of code to construct my timing framework. As they say: Haste makes waste. I have updated the posted code in-place to fix this, and the measured performance is now much closer to the theoretical limit.

Here is a quick benchmark for INT32 throughput testing based on IMAD (where 1 IMAD counts as 2 INT32 ops). On my Quadro RTX 4000, the output is:

testing INT32 op throughput with IMAD (one IMAD = two int ops)
running on device 0 (Quadro RTX 4000)
using 256 threads per block, 524160 blocks, 1.073480 GB used
iops= 4.949010e+12  elapsed=0.58310 sec  throughput=8.48737 Tiops (via IMAD)

According to the throughput table in the CUDA 12.8 Programming Guide, on sm_75 both FFMA and IMAD have the same throughput of 64 instructions / cycle / SM, so this makes sense.

Why this does not jibe with my earlier integer throughput test, I do not know at the moment. According to the throughput table, all simple integer instructions should also have a throughput of 64 instructions / cycle / SM on sm_75, but my earlier test demonstrated only half that rate on real-life code (byte-wise addition with signed saturation). I donā€™t think I dropped a factor of two somewhere, but who knows ā€¦

#include <cstdlib>
#include <cstdio>
#include <cstdint>

#define DEVICE_ORDINAL  (0)
#define THREADS_PER_BLK (256)
#define LEN             (65520 * 1024 * 2)
#define STAGES          (192)
#define REPS            (16)
#define ITER            (10)

const int DEPTH = STAGES;

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

__device__ uint32_t imad_mix (uint32_t a, uint32_t b, uint32_t c)
{
    c = a * b + c;
    a = b * c + a;
    b = c * a + b;
    return b;
}

__global__ void kernel (const uint32_t * __restrict__ src, 
                        uint32_t * __restrict__ dst, 
                        uint32_t a, uint32_t b, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        uint32_t aa = __sinf(a) * a;
        uint32_t bb = __cosf(b) * b;
        uint32_t p = src[i] * aa + bb;
        uint32_t q = src[i] * bb + aa;
        for (int k = 0; k < REPS; k++) {
#pragma unroll DEPTH
            for (int j = 0; j < DEPTH; j++) {
                p = imad_mix (p, bb, aa);
                q = imad_mix (q, bb, aa);
            }
        }
        dst[i] = p * q;
    }
}    

int main (void)
{
    double start, stop, elapsed, mintime=1e308, nbr_of_imad;
    uint32_t *d_a, *d_b;
    struct cudaDeviceProp props;

    printf ("testing INT32 op throughput with IMAD (one IMAD = two int ops)\n");

    CUDA_SAFE_CALL (cudaGetDeviceProperties (&props, DEVICE_ORDINAL));
    printf ("running on device %d (%s)\n", DEVICE_ORDINAL, props.name);

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * LEN));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * LEN));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * LEN)); // zero

    /* Compute execution configuration */
    dim3 dimBlock(THREADS_PER_BLK);
    int threadBlocks = (LEN + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);
    
    printf ("using %d threads per block, %d blocks, %f GB used\n", 
            dimBlock.x, dimGrid.x, 2*1e-9*LEN*sizeof(d_a[0]));

    for (int k = 0; k < ITER; k++) {
        cudaDeviceSynchronize();
        start = second();
        kernel<<<dimGrid,dimBlock>>>(d_a, d_b, 0x5da07326, 0x5102d832, LEN);
        CHECK_LAUNCH_ERROR();
        stop = second();
    }
    elapsed= stop - start;
    if (elapsed < mintime) { mintime = elapsed; }
    nbr_of_imad = (2.0 * DEPTH * REPS + 3.0) * LEN * 3;
    printf ("iops=%13.6e  elapsed=%.5f sec  throughput=%.5f Tiops (via IMAD)\n",
            nbr_of_imad * 2, mintime, nbr_of_imad * 2 *1e-12/mintime);

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;
}

I think bug not important. I think more important that no 2x int32 on 5080 anyway, even if not memory bound.

ps: I got PCIe 5.0 issue with my 5080 also as many other customersā€¦

Sorry, I did not look well, when answering.
Patrick was talking about the cache and 64 bit accesses, and I was looking, where in this thread they were used.
I saw some memory accesses with .64 in them in your SASS.
But it was only for loading once from constant memory and otherwise the 64-bit registers appeared in the index/address brackets.

For actual 64-bit access SASS would show vector loads and stores.

@patrickdurbin24: What were you actually/originally referring to in regards to 64 bits?

FWIW, on my L4 GPU, the output is:


testing INT32 op throughput with IMAD (one IMAD = two int ops)
running on device 0 (NVIDIA L4)
using 256 threads per block, 524160 blocks, 1.073480 GB used
iops= 4.949010e+12  elapsed=0.32922 sec  throughput=15.03257 Tiops (via IMAD)

Still no x2 for 5080 :(

4080 (sm_89):

testing INT32 op throughput with IMAD (one IMAD = two int ops)
running on device 0 (NVIDIA GeForce RTX 4080)
using 256 threads per block, 524160 blocks, 1.073480 GB used
iops= 4.949010e+12 elapsed=0.16675 sec throughput=29.67852 Tiops (via IMAD)

5080:

sm_89:
testing INT32 op throughput with IMAD (one IMAD = two int ops)
running on device 0 (NVIDIA GeForce RTX 5080)
using 256 threads per block, 524160 blocks, 1.073480 GB used
iops= 4.949010e+12 elapsed=0.15847 sec throughput=31.22922 Tiops (via IMAD)

sm_100:
testing INT32 op throughput with IMAD (one IMAD = two int ops)
running on device 0 (NVIDIA GeForce RTX 5080)
using 256 threads per block, 524160 blocks, 1.073480 GB used
iops= 4.949010e+12 elapsed=0.15761 sec throughput=31.40069 Tiops (via IMAD)

sm_101:
testing INT32 op throughput with IMAD (one IMAD = two int ops)
running on device 0 (NVIDIA GeForce RTX 5080)
using 256 threads per block, 524160 blocks, 1.073480 GB used
iops= 4.949010e+12 elapsed=0.15831 sec throughput=31.26056 Tiops (via IMAD)

sm_120:
testing INT32 op throughput with IMAD (one IMAD = two int ops)
running on device 0 (NVIDIA GeForce RTX 5080)
using 256 threads per block, 524160 blocks, 1.073480 GB used
iops= 4.949010e+12 elapsed=0.15761 sec throughput=31.40041 Tiops (via IMAD)

It is obviously possible that my quickly constructed benchmark tests are flawed in some way. But I reviewed them last night and could not find any issues. I checked the SASS and the expected number of instructions are being generated. There should be enough instruction level and thread level parallelism exposed, and the dependency chains should be long enough to avoid becoming memory bandwidth bound.

I am starting to wonder whether some of the lower than expected throughput in some of these tests may have to do with the secret op-steering information emitted by the compiler ptxas not being ā€œoptimalā€, or the register allocation strategy leading to register bank conflicts. Nsight Compute may be able to provide some clues.

Does this matter in the big picture? Probably not. Generally speaking, these modern high-end GPUs are likely to hit the memory bandwidth wall in many real-life use cases. My humble middle-of-the-line Quadro RTX 4000 from six years ago has a (theoretical) memory bandwidth of 416 GB/sec, yet the RTX 5080 with six times the computational horsepower brings only 960 GB/sec to the table.

As for the PCIe 5 issues some see with the RTX 5080, I read about those in one of the online reviews but could not tell how widespread those are. Normally hardware vendors conduct a ā€œplug festā€ for (relatively) new interconnects to ensure that any device X operates smoothly with any platform Y, and resolve any issues prior to shipping product (this frequently involves some amount of finger pointing as to which side is not properly adhering to the spec :-)

Your test is good. I think there is something on nvidia side. Needs updated driver, cuda runtime or compiler to open up Blackwell arch capabilities. My apps definitely will benefit from this x2 if it ever possibleā€¦ This is why Iā€™m a little disappointed now.

Can someone try this bench?
This generates a basically uninterrupted IMAD stream in SASS.
I get about 22 TOPs on my 4090.
Note, I count one IMAD as one op, whereas NVIDIAā€™s whitepaper counts one IMAD as two ops (1x mul, 1x add). So this matches their whitepaper nicely.
My 4090 is liquid cooled and slightly overclocked, so it makes sense Iā€™m a little higher.

I donā€™t yet have a 50-series card to test yet :(

#include <cstdint>
#include <chrono>
#include <iostream>

#define BLOCK_SIZE 1024
#define N_REGISTERS 32
#define ITER 10000000

__device__ __forceinline__ void imad(uint32_t &acc, uint32_t a, uint32_t b, uint32_t c) {
    asm volatile ("mad.lo.u32 %0, %1, %2, %3;\n"
        : "=r"(acc)
        : "r"(a), "r"(b), "r"(c));
}

__global__ void k_imad_bench(uint32_t *res) {
    uint32_t r[N_REGISTERS] = {0};

    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = 0; i < N_REGISTERS; i++) {
        r[i] = i + tid;
    }

    for (int i = 0; i < ITER; i++) {
#pragma unroll
        for (int j = 0; j < N_REGISTERS; j++) {
            imad(r[j], r[j], r[j], r[j]);
        }
    }

    for (int i = 0; i < N_REGISTERS; i++) {
        res[tid] ^= r[i];
    }
}

int main() {
    cudaDeviceProp dev_prop;
    cudaGetDeviceProperties(&dev_prop, 0);
    uint32_t *res;
    uint64_t total_threads = dev_prop.multiProcessorCount * BLOCK_SIZE;
    cudaMallocManaged(&res, sizeof(uint32_t) * total_threads);
    auto start = std::chrono::high_resolution_clock::now();
    k_imad_bench<<<dev_prop.multiProcessorCount, BLOCK_SIZE>>>(res);
    cudaDeviceSynchronize();
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double, std::milli> elapsed = end - start;
    double ms = elapsed.count();
    uint64_t total_instructions = total_threads * (uint64_t) ITER * N_REGISTERS;
    double seconds = ms / 1000.0;
    double throughput = total_instructions / seconds;
    printf("Throughput: %e IMAD/sec\n", throughput);
    return 0;
}

4080 sm_89
Throughput: 1.447048e+13 IMAD/sec

5080 sm_89
Throughput: 1.522615e+13 IMAD/sec

5080 sm_120
Throughput: 1.542646e+13 IMAD/sec

1 Like

:( disappointing. Hopefully itā€™s a compiler/driver/firmware issue. Hopefully we will get a fix in CUDA version 13

For the sake of interest just ran all above tests on Ubuntu 24.04.1 with latest 570.86.16 beta driver. Same result. No 2x for 5080.

1 Like

Would be nice to hear from Nvidia on this issue. Itā€™s in the whitepaper, but the community canā€™t replicate it.

I guess while weā€™re waiting, could you also do a benchmark for simply copying some data to and from the 5080? Assuming you have a PCIe 5.0 capable motherboard/CPU, can we actually reach PCIe 5.0 speeds for a simple copy?

5080 PCIe 5.0 x16
bandwidthTest
[CUDA Bandwidth Test] - Startingā€¦
Running onā€¦

Device 0: NVIDIA GeForce RTX 5080
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 53.1

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 39.3

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 2823.4

4090 PCIe 4.0 x16
bandwidthTest
[CUDA Bandwidth Test] - Startingā€¦
Running onā€¦

Device 0: NVIDIA GeForce RTX 4090
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 23.9

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 26.0

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 3330.3