Memory-intensive applications? How to know the resource bottleneck?

bowuwm · January 26, 2016, 10:35pm

Many studies categorize applications as compute-intensive and memory-intensive. My understanding of memory intensive applications are those whose throughput is bounded by the device memory bandwidth. However, after profiling my application, I got confused. The reason is that its requested memory throughput is the same as the actual throughput. Does it mean the memory system is powerful enough to supply enough data? But the main reason for instruction issue stalls is data requests (>55%). How do I explain this?

BulatZiganshin · January 28, 2016, 3:36pm

can you show more details, in particular the profiler output?

bowuwm · January 28, 2016, 6:20pm

I used the idea of persistent threads to transform the kernel, such that a small number of thread blocks can process all the tasks. Run 1 (persistent threads version) and Run 2 (the original version) took almost the same time. Since the profile ouput is very long, I just paste some important metrics. Please let me know if you want to see others.

What’s strange is that Run 1 and Run 2 have very different requested load throughput and actual load throughput, but their execution time is similar.

Run 1:

Executed IPC 1.235529
Requested Global Load Throughput 88.579GB/s
Requested Global Store Throughput 44.290GB/s
Device Memory Read Throughput 88.586GB/s
Device Memory Write Throughput 44.300GB/s
Issue Stall Reasons (Instructions Fetch) 6.51%
Issue Stall Reasons (Execution Dependenc 22.44%
Issue Stall Reasons (Data Request) 37.28%
Issue Stall Reasons (Texture) 0.00%
Issue Stall Reasons (Synchronization) 24.73%
Device Memory Utilization Mid (5)
Load/Store Function Unit Utilization Mid (4)

Run 2:
Executed IPC 0.886840
Requested Global Load Throughput 110.34GB/s
Requested Global Store Throughput 55.168GB/s
Device Memory Read Throughput 110.34GB/s
Device Memory Write Throughput 55.179GB/s
Issue Stall Reasons (Instructions Fetch) 3.40%
Issue Stall Reasons (Execution Dependenc 16.05%
Issue Stall Reasons (Data Request) 51.94%
Issue Stall Reasons (Synchronization) 22.02%
Device Memory Utilization Mid (6)
Load/Store Function Unit Utilization Low (3)

njuffa · January 28, 2016, 6:27pm

What GPU are you using? You would want to compare the actual memory throughput to the theoretical throughput specified for the GPU. With well coalesced memory accesses, and without ECC, you should be able to achieve up to 85% of the theoretical maximum.

bowuwm · January 28, 2016, 6:45pm

I used both K40 and Titan Z and observed very similar results. The benchmark is just vector addition. I’m confused because the memory throughput seems not the bottleneck, but the two runs took almost the same time. The bottleneck must be somewhere else. Maybe it’s the load unit. Not sure about that.

njuffa · January 28, 2016, 7:10pm

The theoretical memory throughput of the Tesla K40 is listed as 288 GB/sec. With ECC off, expect maximum memory throughput of around 240 GB/sec, with ECC on about 210 GB/sec. Your profiler data seems to suggest a memory throughput of <= 165 GB/sec. Maybe your vectors are too short, or use 32-bit accesses (instead of 64-bit or 128 bit accesses).

Note that various NVIDIA GPUs require the use of application clocks set higher than the default clock to achieve the maximum memory throughput. nvidia-smi will show the available application clocks for your device, and allow you to set the application clock with the -ac switch.

Below is my DCOPY code that I use to measure GPU memory throughput via a straight copy operation, which you could use as a comparison. You could also modify it to do vector addition.

#include <stdlib.h>
#include <stdio.h>

#define DCOPY_THREADS  128
#define DCOPY_DEFLEN   20000000
#define DCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void dcopy (const double * __restrict__ src, 
                       double * __restrict__ dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];
    }
}    

struct dcopyOpts {
    int len;
};

static int processArgs (int argc, char *argv[], struct dcopyOpts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double start, stop, elapsed, mintime;
    double *d_a, *d_b;
    int errors;
    struct dcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : DCOPY_DEFLEN;

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(DCOPY_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    if (threadBlocks > 65520) threadBlocks = 65520;
    dim3 dimGrid(threadBlocks);
    
    printf ("dcopy: operating on vectors of %d doubles (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("dcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < DCOPY_ITER; k++) {
        start = second();
        dcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("dcopy: mintime = %.3f msec  throughput = %.2f GB/sec\n",
            1.0e3 * mintime, (2.0e-9 * sizeof(d_a[0]) * opts.len) / mintime);

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;
}

Robert_Crovella · January 30, 2016, 4:47pm

Requested memory throughput matching actual throughput probably just means that your loads are well-coalesced.

The memory system still has a latency associated with it, as well as a bandwidth. Data requests that are suffering latency can still cause issue stalls.

Topic		Replies	Views
Global memory access bottleneck CUDA Programming and Performance	8	3551	September 4, 2015
Effective global memory bandwidth? CUDA Programming and Performance	17	17654	September 18, 2007
Profiling my code I need some help to understand the output of the visual profiler CUDA Programming and Performance	5	1908	February 3, 2012
Method to get global memory throughput of GPU questions about Sylvain's code CUDA Programming and Performance	1	1714	March 15, 2010
Visual debugger to see if mem access is coalesced CUDA Programming and Performance	7	1080	November 1, 2011
How to know where the bottleneck is? CUDA Programming and Performance	3	4307	February 29, 2008
Memory intensive CUDA benchmarks CUDA Programming and Performance	0	517	June 12, 2018
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	7093	November 2, 2010
Memory throughput problem between host and device with pinned memory CUDA Programming and Performance	14	1932	April 28, 2015
Theoretical ON-CHIP Bandwidth how to determine? CUDA Programming and Performance	15	11706	October 16, 2009

Memory-intensive applications? How to know the resource bottleneck?

Related topics