Forcing store to write to HBM

nish511 · August 28, 2023, 11:26pm

I am trying to benchmark memory bandwidth performance of Datacenter GPUs (V/A/H100) using following

microsoft/superbenchmark/blob/27a10811afb4f2f9c5404d02b1056391f14f4b1a/superbench/benchmarks/micro_benchmarks/gpu_copy_performance/gpu_copy.cu#L523


      
          // This kernel references the implementation in
          // 1) NCCL:
          // https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L486
          // 2) RCCL:
          // https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L276
          inline __device__ void StoreULong2(ulong2 *p, ulong2 &v) {
          #if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
              p->x = v.x;
              p->y = v.y;
          #else
              asm volatile("st.volatile.global.v2.u64 [%0], {%1,%2};" ::"l"(p), "l"(v.x), "l"(v.y) : "memory");
          #endif
          }
          
          // Fetch data from source memory into register first, and then write them to target memory
          // Stride set to thread block size to best utilize cache
          __global__ void SMCopyKernel(ulong2 *tgt, const ulong2 *src) {
              uint64_t index = blockIdx.x * blockDim.x * NUM_LOOP_UNROLL + threadIdx.x;
              ulong2 val[NUM_LOOP_UNROLL];
          #pragma unroll
              for (uint64_t i = 0; i < NUM_LOOP_UNROLL; i++)

This code loads the data from HBM and stores it back to different location on HBM.

What I am observing is that the st.volatile.global.v2.u64 writes the data to L2 cache and is never getting written back to HBM. I confirmed this from ncu profile.

How do I make sure the store actually goes back to HBM to make sure I can benchmark both read and write.

njuffa · August 29, 2023, 12:13am

To measure main memory bandwidth accurately one needs to write or copy blocks of data significantly larger than the last-level cache size. Even with a cache using a write-back policy, this will cause the vast majority of the data to be pushed out to main memory because of capacity misses.

nish511 · August 29, 2023, 12:52am

If I also want to characterize the latencies of memory write and reads, I would need small data size.

There is no way to force write through L2 ?

njuffa · August 29, 2023, 1:05am

You will need to devise different tests for different purposes. Block copies and block stored can tell us something about the throughput of caches and main memory.

For measuring latency, I have used pointer chasing in the past, on CPUs. Using an LFSR, such a test can visit 2ⁿ-1 locations in a memory block of size 2ⁿ elements in “random” order. GPUs are designed as throughput machines, so I have had no need to measure latency. Other approaches for measuring latency likely exist, check the literature and open source software.

Don’t know, check the documentation.

Below is an example program I used to measure latencies on an Xeon E3-1270 (IvyBridge) CPU. It prints:

n=8 count=255 size=2040 bytes
elapsed = 2.93250196e-007  per pointer: 4.25500284e+000 cycles
n=9 count=511 size=4088 bytes
elapsed = 5.86500391e-007  per pointer: 4.24667602e+000 cycles
n=10 count=1023 size=8184 bytes
elapsed = 8.79634172e-007  per pointer: 3.18147257e+000 cycles
n=11 count=2047 size=16376 bytes
elapsed = 2.34576873e-006  per pointer: 4.24003142e+000 cycles
n=12 count=4095 size=32760 bytes
elapsed = 4.39840369e-006  per pointer: 3.97413764e+000 cycles    <<<<
n=13 count=8191 size=65528 bytes
elapsed = 1.87661499e-005  per pointer: 8.47695697e+000 cycles
n=14 count=16383 size=131064 bytes
elapsed = 4.54494730e-005  per pointer: 1.02644845e+001 cycles
n=15 count=32767 size=262136 bytes
elapsed = 1.00282137e-004  per pointer: 1.13237070e+001 cycles    <<<<
n=16 count=65535 size=524280 bytes
elapsed = 4.49803192e-004  per pointer: 2.53951600e+001 cycles
n=17 count=131071 size=1048568 bytes
elapsed = 1.08961470e-003  per pointer: 3.07587063e+001 cycles
n=18 count=262143 size=2097144 bytes
elapsed = 2.32290826e-003  per pointer: 3.27865347e+001 cycles   <<<<

From this we see that L1 cache is 32KB in size with access latency of 4 cycles; the L2 cache is 256 KB in size with access latency of 11-12 cycles; L3 cache has an access latency of 33 cycles. Note that the timing methodology used in the program only has a time resolution of slightly under one microsecond. Also, I ran this program on a partially busy machine and did not pin it to one CPU core. Therefore the results are probably noisier than necessary.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

#define PROC_FREQ  (3700000000LL) // 3.7 GHz
#define MAX_POW    (18) // use up to 2**18 pointer

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif


volatile uintptr_t ptr_array [1 << MAX_POW];

int main (void) 
{
    const int lfsr_mask [MAX_POW+1] = { 0, 0, 0x3, 0x6, 0xc, 0x14, 0x30, 
                                        0x60, 0xb8, 0x110, 0x240, 0x500,
                                        0xe08, 0x1c80, 0x3802, 0x6000,
                                        0xd008, 0x12000, 0x20400};
    double start, stop, elapsed;
    int count, mask, state, new_state;

    for (int n = 8; n < (MAX_POW+1); n++) {
        /* use LFSR to initialize array */
        mask = lfsr_mask [n];
        count = 0;
        state = 1;
        do {
            new_state = (state & 1) ? ((state >> 1) ^ mask) : (state >> 1);
            ptr_array [state] = (uintptr_t)(&ptr_array [new_state]);
            state = new_state;
            count++;
        } while (state != 1);
        printf ("n=%d count=%d size=%d bytes\n", 
                n, count, count * (int)(sizeof(*ptr_array)));
        
        /* chase the pointers */
        for (int j = 0; j < 3; j++) {
            volatile uintptr_t *addr = &ptr_array[1];
            start = second();
            for (int i = 1; i < count; i++) {
                addr = (uintptr_t *)(*addr);
            }
            stop = second();
        }
        elapsed = stop - start;
        printf ("elapsed = %15.8e  per pointer: %15.8e cycles\n", 
                elapsed, (elapsed / count) / (1.0 / PROC_FREQ));
    }
    return EXIT_SUCCESS;
}

Topic		Replies	Views
Very confused about the number of bytes stored to HBM Nsight Compute	0	521	October 14, 2021
How many warps per SM to hide global mem latency? CUDA Programming and Performance	15	14256	November 18, 2008
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	7094	November 2, 2010
Effective global memory bandwidth? CUDA Programming and Performance	17	17655	September 18, 2007
Batch write CUDA Programming and Performance	1	4876	September 22, 2008
Speed-Loss by Writing to Global Mem CUDA Programming and Performance	3	2076	March 31, 2008
Fast vs. Slow memcpy Trying to understand GPU I/O via memcpy CUDA Programming and Performance	8	14580	May 5, 2011
How to write efficient from local to glocal memory Writing - time problems CUDA Programming and Performance	3	5565	December 5, 2007
Latency for writes to global memory CUDA Programming and Performance	5	3350	July 24, 2009
global memory bandwidth problem CUDA Programming and Performance	4	1444	March 2, 2010

Forcing store to write to HBM

Related topics