theoretical/real shared/dram peak memory throughput

Hi

following questions assume Kepler generation.

The peak bandwidth of shared memory is computed by
f_core * #banks * bank_width * #SMs.
For K80 the result would be:
0.875 GHz * 32 * 8 bytes * 13 = 2912 GB/s.
Is this peak value also theoretically possible for smem loads only, i.e. a kernel which only loads from shared memory?

How nvprof computes the L1/Shared Memory utilization (nvprof -m l1_shared_utilization)?
Imagine a kernel with equal number of smem load and store transactions yields a performance of 1000 GB/s for smem load and for store throughput on a K80 with ECC enabled.
The theoretical peak of each one would be 2912/2 = 1456 GB/s. nvprof gives a “High (8)” for the 1000 GB/s.
Does this “High (8)” take ECC penalties and the relation of the number of load+stores into account?

For DRAM there is a difference between theoretical and real peak bandwidth by a factor of about 20%.
Is there a similar value for shared memory bandwidth?

Best Regards and thank your for your consideration!

There are no “ECC penalties” for shared memory.

shared memory supports one access per cycle on Kepler. You should be able to witness “full” bandwidth even for a sustained sequence of read cycles.

The utilization metrics are all based on a scale of 0 to 10, where 10 is the highest utilization and 0 is the lowest. It should be approximately equal to:

1 = 10% of peak utilization
2 = 20% of peak utilization
3 = 30% of peak utilization
etc.

Based on your data (you’ve offered no actual verifiable test case) I would expect the profiler to report 7 instead of 8. If your question is ultimately why does the profiler report 8 instead of 7, I can’t answer that. It’s possible for example that the K80 is in some clock boost scenario other than what you expect (e.g. base clocks vs. expected boost) that is clouding your understanding of the number. The 875MHz number is a fully boosted clock number. It may also be that L1 activity is contributing. The utilization numbers are approximate and are primarily intended to be used for relative comparisons. If you want precise numbers, it’s recommended that you use the throughput numbers and use whatever arithmetic is necessary.

AFAIK there is no similar reduction for shared memory. DRAM has a number of overheads that shared memory does not.

DRAM also suffers from read/write turn-around overhead. My understanding is that the GPU performs some limited re-ordering within its load/store facilities to reduce this overhead, at the expense of added latency (which is fine in the context of the GPU’s throughput-oriented architecture).

It has been a long time since I studied DRAM implementation details, so I cannot say how big read/write turnaround overhead is relative to the address/data multiplexing overhead mentioned by txbob, but I seem to recall the cost is typically about 15% of theoretical bandwidth.

I also decided that my DRAM knowledge is somewhat dated and removed most of the details from my previous post. SDRAM can achieve nearly peak theoretical bandwidth for perfectly optimal access patterns (which include a requirement for no bus turnarounds), but I’m not familiar enough with the organization of memory segments as seen by the memory controller to actual banks within SDRAM partitions to be able to say what would be required to achieve this from a CUDA memory address standpoint. It may be trivial to achieve, or nearly impossible to achieve in practice, I’m not sure.

I also think SDRAM still suffers from some refresh overhead, but I think this is potentially quite small (~0.4%).

Nevertheless, observed CUDA memory bandwidths usually incur some penalty against theoretical maximum, on the order of 5-25%, approximately. I know of no similar reduction for shared memory.

The last memory technology I was intimately familiar with was the original SDRAM, before they added DDR etc. Refresh overhead was a non-issue already at that stage.

The 15% average read/write overhead number was based on my knowledge of SDRAM (ancient by now!), but my experience since then seems to indicate that the problem has gotten somewhat worse as SDRAM was refined into DDR{2,3,4} and various forms of GDDR. The basic issue seems to be that access latency has improved relatively little, while bandwidth has grown by leaps and bounds, meaning any overhead with a latency component becomes more pronounced in percentage terms.

It might be interesting to run an experiment looking at the throughput of a pure read stream on the host’s DDR4, with proper pre-warming of TLBs etc, then repeat that experiment on the GPU with a “perfect” access pattern.

Real-world scenarios obviously involve both read and write streams, which is what the STREAM benchmark tries to distill into four kernels.

[Later:] This is just an anecdotal reference, but the graphs on this website seem to indicate that a four-channel DDR4-2400 configuration provides identical bandwidth of about 58 GB/sec for both the “read” and the “copy” tests, versus a theoretical bandwidth of 19.2 x 4 = 76.8 GB/sec, for an efficiency of 75.5%: http://www.techspot.com/news/62129-ddr3-vs-ddr4-raw-bandwidth-numbers.html

Not sure why “read” and “copy” throughput numbers are essentially the same, but it would seem to point to a common overhead different from read/write turnaround.

If anybody has recent design experience with DDR4 and GDDR5, I would love to see an explanation of relevant sources of overhead with those.

Hi
thank you very much for your replies.
The L1/Shared Utilization refers to the clock set at runtime which makes sense anyway. So it always shows e.g. “High (8)” from 562 MHz, 823 MHz to 875 MHz. This is also true for alu_fu_utilization and ldst_fu_utilization. dram_utilization probably does not take ECC into account, see below.

Of course, L1/Shared Utilization also includes L1 transactions, so for a pure metric one has to compute it based on the shared memory throughput values as you suggested.
And as njuffa said there are no ECC penalties on shared memory (might be also the case for L1 and L2 cache?!).

I also learned from you that there are DRAM specific overheads like for read/write turn-arounds, limited re-orderings and address data multiplexing in the device memory. I guess this remains for HBM2 in the Pascal architecture, since it is just a 3D stacked DRAM?!

For a K80 the DRAM peak bandwidth would be 5010 MHz * 384/8 byte = 240.480 GB/s.
I played with the kernel instruction at the benchmark from How to Access Global Memory Efficiently in CUDA C/C++ Kernels (without offsets and strides)

a[i] = a[i]+1; # 140 GB/s (ECC enabled), measured with cudaEvent
a[i] = 1;      # 170 GB/s (ECC enabled), measured with cudaEvent

The ECC penalty should be around 20%, so that peak bandwidth would be now 192 GB/s. However, these values base on cuda event records which might be not very reliable, so nvprof to the rescue.

a[i] = a[i]+1; # 165 GB/s (ECC enabled), nvprof
a[i] = 1;      # 190 GB/s (ECC enabled), nvprof

Based on the 192 GB/s theoretical peak bandwidth, the efficiencies range from 86% to 99% here, depending on the read+write or writes-only scenario.

In detail:

nvprof -m dram_write_throughput,dram_read_throughput,dram_utilization,global_replay_overhead,global_cache_replay_overhead ./coalescing

                 Metric Description         Min         Max         Avg

     Device Memory Write Throughput  82.103GB/s  83.095GB/s  82.493GB/s
      Device Memory Read Throughput  80.365GB/s  81.524GB/s  80.884GB/s
          Device Memory Utilization    High (8)    High (8)    High (8)
      Global Memory Replay Overhead    0.000000    0.000000    0.000000
Global Memory Cache Replay Overhead    0.000000    0.000000    0.000000

nvprof -m dram_write_throughput,dram_read_throughput,dram_utilization,global_replay_overhead,global_cache_replay_overhead ./coalescing_writes_only

       Device Memory Write Throughput  186.69GB/s  189.80GB/s  188.67GB/s
        Device Memory Read Throughput  19.751MB/s  40.277MB/s  28.610MB/s
            Device Memory Utilization    High (9)    High (9)    High (9)
        Global Memory Replay Overhead    0.000000    0.000000    0.000000
  Global Memory Cache Replay Overhead    0.000000    0.000000    0.000000

(K80, 2505 MHz effective DDR memory clock, 823 MHz GPU clock, ECC enabled, 34 runs)

Best Regards.

PS on my shared memory example for the sake of completeness: Although I cannot provide code at the moment, here are some profiler results (working at 0.823 GHz). I know it is not sufficient for any conclusions.
When I find some time, I’ll extract the code, where I used shared memory (SoA pattern) to circumvent local memory (AoS pattern) and I would open a separate thread for discussion.

.                   Shared Load Transactions  1080688640
                   Shared Store Transactions  1074085888
 Shared Memory Load Transactions Per Request    1.006147
Shared Memory Store Transactions Per Request    1.000000
               Shared Memory Load Throughput  930.40GB/s
              Shared Memory Store Throughput  924.71GB/s
                L1/Shared Memory Utilization    High (8)
                   L2 Hit Rate (L1 Reads)      50.00%
                 L2 Throughput (L1 Reads)  3.4229MB/s
  L2 Read Transactions (L1 read requests)        1872
L2 Write Transactions (L1 write requests)         936
                L2 Throughput (L1 Writes)  1.7115MB/s

I also used gpumembench on github to get the peak values for shared memory bandwidth, which has shown a perfect match with the theoretical peak at least on our K80.

Why speculate about the bandwidth with ECC turned off when you could just try it by disabling ECC with nvidia-smi and rebooting the machine? ECC overhead is likely smaller than 20% on a K80, as the overhead was reduced at least twice from NVIDIA’s initial ECC implementation, based on my own observations about the overhead.

When you determine DRAM throughput, keep in mind that wider accesses typically improve throughput since this makes better use of internal buffering of requests in the load/store units (this effect is less pronounced in newer architectures). So you want 16-byte accesses for measuring maximum achievable throughput. Also, long transfers tend to me more efficient, try tens of megabytes. There are also other artifacts visible when you do a shmoo plot of transfer length versus throughput, although in recent architectures the resulting curve looks fairly flat. So try multiple different memory blocks sizes.

Hi
true, and I would like to test with ECC on/off too, but I do not have the privileges to control such settings for the K80s at our cluster :/

Thanks for the note on improving DRAM throughput and I already did quite a similar benchmark a few months ago, so let’s have a look on the results.

// output, time measurement with cudaEvents
100 runs, 1048576 int objects packed into int4, 256 blocks per SM, 256 threads per block:
kernel runtime = 0.186335 ms [avg]
dram bandwidth = 180.076 GB/s [avg]
//
// nvprof -m dram_write_throughput,dram_read_throughput,dram_utilization,global_replay_overhead,global_cache_replay_overhead ./arrays 1048576 256

.                 Metric Description         Min         Max         Avg

      Device Memory Write Throughput  96.156GB/s  97.554GB/s  97.016GB/s
       Device Memory Read Throughput  95.125GB/s  96.179GB/s  95.771GB/s
           Device Memory Utilization    High (9)    High (9)    High (9)
       Global Memory Replay Overhead    0.300940    0.300940    0.300940
 Global Memory Cache Replay Overhead    0.000000    0.000000    0.000000

Corresponding code:

__global__                                          
void dkernel(int4* values, int n) {                    
  for (int i = blockIdx.x * blockDim.x + threadIdx.x;
       i < n;                                       
       i += blockDim.x * gridDim.x)                 
  {                                                 
    int v1 = values[i].x;                          
    int v2 = values[i].y;                          
    int v3 = values[i].z;                          
    int v4 = values[i].w;                          
    values[i].x = v4;                               
    values[i].y = v3;                               
    values[i].z = v2;                               
    values[i].w = v1;                               
  }                                                 
}

And writes-only (v1,…,v4=1) gives in nvprof:

.    Device Memory Write Throughput  199.49GB/s  201.25GB/s  200.14GB/s
      Device Memory Read Throughput  6.3393MB/s  62.059MB/s  9.5367MB/s
          Device Memory Utilization    High (9)    High (9)    High (9)
      Global Memory Replay Overhead    0.157895    0.157895    0.157895
Global Memory Cache Replay Overhead    0.000000    0.000000    0.000000

So about 5% better performance with int4 stores, but the implementations differ too much, so I have to implement an equal one with int stores …

Best Regards.

If you want a “second opinion”, here is the code (zcopy) I have used for several years for measuring memory throughput.

#include <stdlib.h>
#include <stdio.h>

#define ZCOPY_THREADS  128
#define ZCOPY_DEFLEN   10000000
#define ZCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void zcopy (const double2 * __restrict__ src, 
                       double2 * __restrict__ dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];
    }
}    

struct zcopyOpts {
    int len;
};

static int processArgs (int argc, char *argv[], struct zcopyOpts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double start, stop, elapsed, mintime;
    double2 *d_a, *d_b;
    int errors;
    struct zcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : ZCOPY_DEFLEN;

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(ZCOPY_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    if (threadBlocks > 65520) threadBlocks = 65520;
    dim3 dimGrid(threadBlocks);
    
    printf ("zcopy: operating on vectors of %d double2s (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("zcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < ZCOPY_ITER; k++) {
        start = second();
        zcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("zcopy: mintime = %.3f msec  throughput = %.2f GB/sec\n",
            1.0e3 * mintime, (2.0e-9 * sizeof(d_a[0]) * opts.len) / mintime);

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;
}

Thanks a lot for the code, these are the results for 256 threads per block and 2^{24} double2 for read+write and 2^{25} for writes-only kernel (seems to be the fastest parameters).

.                Metric Description         Min         Max         Avg

     Device Memory Write Throughput  97.290GB/s  97.672GB/s  97.462GB/s
      Device Memory Read Throughput  93.609GB/s  94.028GB/s  93.808GB/s
          Device Memory Utilization    High (9)    High (9)    High (9)
      Global Memory Replay Overhead    0.142875    0.142875    0.142875
Global Memory Cache Replay Overhead    0.000000    0.000000    0.000000

zcopy: operating on vectors of 16777216 double2s (= 2.684e+08 bytes)
zcopy: using 256 threads per block, 65520 blocks
zcopy: mintime = 3.033 msec  throughput = 177.01 GB/sec

Write-only (write a double2-zero):

.    Device Memory Write Throughput  203.33GB/s  203.46GB/s  203.39GB/s
      Device Memory Read Throughput  527.43KB/s  1.0718MB/s  0.00000B/s
          Device Memory Utilization    Max (10)    Max (10)    Max (10)
      Global Memory Replay Overhead    0.214316    0.214316    0.214316
Global Memory Cache Replay Overhead    0.000000    0.000000    0.000000

// program output (throughput computation for writes-only)
zcopy: operating on vectors of 33554432 double2s (= 5.369e+08 bytes)
zcopy: using 256 threads per block, 65520 blocks
zcopy: mintime = 2.913 msec  throughput = 184.30 GB/sec

192 GB/s / 203 GB/s = 0.9458128, so overheads in global memory due to read+writes gives a penalty of about 5% on our K80.
203 GB/s is still only 83 % of the theoretical peak of 240 GB/s on the K80 with enabled ECC.

Best Regards.

83% of theoretical bandwidth with ECC enabled sounds about right.

Note that various GPUs can only achieve the maximum possible memory throughput if the GPU core runs sufficiently fast, as a portion of the memory controller seems to operate in that clock domain. To maximize your memory throughput, you would want to force the highest supported application clocks via nvidia-smi (-ac switch).

For real applications, setting the highest supported application clocks can sometimes cause the GPU to exceed the default enforced power limit, leading to clock throttling, so you would want to use nvidia-smi (-pl switch) to force the highest supported enforced power limit.

yup thanks for the hints which lead me to another small benchmark. GPU clock rate is at least responsible for the load/store units and how fast they issue the memory instructions. Due to the long latency global memory there will be a point of saturation. A read+write benchmark with your zcopy code results (memory always run at 2505 MHz effective DDR memory speed):

// bandwidth reported by zcopy -n$((1<<25))
875 MHz : 172.57 GB/sec
862 MHz : 172.96 GB/sec
849 MHz : 173.22 GB/sec
836 MHz : 172.60 GB/sec
823 MHz : 172.55 GB/sec
810 MHz : 172.66 GB/sec
797 MHz : 172.96 GB/sec
784 MHz : 172.55 GB/sec
771 MHz : 172.55 GB/sec
758 MHz : 172.57 GB/sec
745 MHz : 173.55 GB/sec
732 MHz : 172.22 GB/sec
719 MHz : 173.94 GB/sec
705 MHz : 172.43 GB/sec
692 MHz : 173.61 GB/sec
679 MHz : 172.12 GB/sec
666 MHz : 171.66 GB/sec
653 MHz : 171.09 GB/sec
640 MHz : 168.27 GB/sec
627 MHz : 164.63 GB/sec
614 MHz : 162.20 GB/sec
601 MHz : 158.07 GB/sec
588 MHz : 154.72 GB/sec
575 MHz : 151.85 GB/sec
562 MHz : 148.49 GB/sec

At ~666 MHz GPU clock the rate of instructions issued is too low to saturate the device memory bus.