Memory-intensive applications? How to know the resource bottleneck?

Many studies categorize applications as compute-intensive and memory-intensive. My understanding of memory intensive applications are those whose throughput is bounded by the device memory bandwidth. However, after profiling my application, I got confused. The reason is that its requested memory throughput is the same as the actual throughput. Does it mean the memory system is powerful enough to supply enough data? But the main reason for instruction issue stalls is data requests (>55%). How do I explain this?

can you show more details, in particular the profiler output?

I used the idea of persistent threads to transform the kernel, such that a small number of thread blocks can process all the tasks. Run 1 (persistent threads version) and Run 2 (the original version) took almost the same time. Since the profile ouput is very long, I just paste some important metrics. Please let me know if you want to see others.

What’s strange is that Run 1 and Run 2 have very different requested load throughput and actual load throughput, but their execution time is similar.

Run 1:

Executed IPC 1.235529
Requested Global Load Throughput 88.579GB/s
Requested Global Store Throughput 44.290GB/s
Device Memory Read Throughput 88.586GB/s
Device Memory Write Throughput 44.300GB/s
Issue Stall Reasons (Instructions Fetch) 6.51%
Issue Stall Reasons (Execution Dependenc 22.44%
Issue Stall Reasons (Data Request) 37.28%
Issue Stall Reasons (Texture) 0.00%
Issue Stall Reasons (Synchronization) 24.73%
Device Memory Utilization Mid (5)
Load/Store Function Unit Utilization Mid (4)

Run 2:
Executed IPC 0.886840
Requested Global Load Throughput 110.34GB/s
Requested Global Store Throughput 55.168GB/s
Device Memory Read Throughput 110.34GB/s
Device Memory Write Throughput 55.179GB/s
Issue Stall Reasons (Instructions Fetch) 3.40%
Issue Stall Reasons (Execution Dependenc 16.05%
Issue Stall Reasons (Data Request) 51.94%
Issue Stall Reasons (Synchronization) 22.02%
Device Memory Utilization Mid (6)
Load/Store Function Unit Utilization Low (3)

What GPU are you using? You would want to compare the actual memory throughput to the theoretical throughput specified for the GPU. With well coalesced memory accesses, and without ECC, you should be able to achieve up to 85% of the theoretical maximum.

I used both K40 and Titan Z and observed very similar results. The benchmark is just vector addition. I’m confused because the memory throughput seems not the bottleneck, but the two runs took almost the same time. The bottleneck must be somewhere else. Maybe it’s the load unit. Not sure about that.

The theoretical memory throughput of the Tesla K40 is listed as 288 GB/sec. With ECC off, expect maximum memory throughput of around 240 GB/sec, with ECC on about 210 GB/sec. Your profiler data seems to suggest a memory throughput of <= 165 GB/sec. Maybe your vectors are too short, or use 32-bit accesses (instead of 64-bit or 128 bit accesses).

Note that various NVIDIA GPUs require the use of application clocks set higher than the default clock to achieve the maximum memory throughput. nvidia-smi will show the available application clocks for your device, and allow you to set the application clock with the -ac switch.

Below is my DCOPY code that I use to measure GPU memory throughput via a straight copy operation, which you could use as a comparison. You could also modify it to do vector addition.

#include <stdlib.h>
#include <stdio.h>

#define DCOPY_THREADS  128
#define DCOPY_DEFLEN   20000000
#define DCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#include <windows.h>
double second (void)
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
#error unsupported platform

__global__ void dcopy (const double * __restrict__ src, 
                       double * __restrict__ dst, int len)
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];

struct dcopyOpts {
    int len;

static int processArgs (int argc, char *argv[], struct dcopyOpts *opts)
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
    return error;

int main (int argc, char *argv[])
    double start, stop, elapsed, mintime;
    double *d_a, *d_b;
    int errors;
    struct dcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    opts.len = (opts.len) ? opts.len : DCOPY_DEFLEN;

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(DCOPY_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    if (threadBlocks > 65520) threadBlocks = 65520;
    dim3 dimGrid(threadBlocks);
    printf ("dcopy: operating on vectors of %d doubles (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("dcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < DCOPY_ITER; k++) {
        start = second();
        dcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    printf ("dcopy: mintime = %.3f msec  throughput = %.2f GB/sec\n",
            1.0e3 * mintime, (2.0e-9 * sizeof(d_a[0]) * opts.len) / mintime);

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;

Requested memory throughput matching actual throughput probably just means that your loads are well-coalesced.

The memory system still has a latency associated with it, as well as a bandwidth. Data requests that are suffering latency can still cause issue stalls.