why load vector4 not faster than single load?

siri · March 30, 2019, 2:18pm

Hi, I compared load vector4 and load float, but it seems load vector4 is not faster than load float when i set blocksize=512, could anyone help to explain this?

HW: v100

global void device_copy_vector4_kernel(int* d_in, int* d_out, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for(int i = idx; i < N/4; i += blockDim.x * gridDim.x) {
reinterpret_cast<int4*>(d_out)[i] = reinterpret_cast<int4*>(d_in)[i];
}
}

global void device_copy_vector1_kernel(int* d_in, int* d_out, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for(int i = idx; i < N; i += blockDim.x * gridDim.x) {
d_out[i] = d_in[i];
}

}

void device_copy_vector4(int* d_in, int* d_out, int N) {
int threads = 512;
int blocks = min((N/4 + threads-1) / threads, MAX_BLOCKS);

device_copy_vector4_kernel<<<blocks, threads>>>(d_in, d_out, N);
}

int main()
{
const int size = 1024000;
int *in = new int;
int *out = new int;

    for (int i = 0; i < size; ++i){
        in[i] = i;
    }
    int *d_i, *d_o;
    cudaMalloc((void**)&d_i, sizeof(int)*size);
    cudaMalloc((void**)&d_o, sizeof(int)*size);
    cudaMemcpy(d_i, in, sizeof(int)*size, cudaMemcpyHostToDevice);

    device_copy_vector4((int*)d_i, (int*)d_o, size);
    delete []in;
    delete []out;
    return 0;

}

njuffa · March 30, 2019, 5:27pm

Let me try an analogy: The hose from which you are drinking always delivers the same amount of water, regardless of whether you take frequent small sips or less frequent large gulps.

Memory chips can only deliver data at a certain maximum rate (primarily a function of the bit-width of the memory interface and its operating frequency). If that rate is already exhausted using narrow accesses, using wider accesses won’t cause any more data to flow.

There may be other bottlenecks in a processor that favor the use of large accesses. Such a bottleneck existed in early CUDA-enabled GPUs, where throughput could also be limited due to the limited depth of the load/store queue. Since each access would take up an entry in the queue, regardless of width, using wide accesses allowed for queuing up more total work. That was more than ten years ago and this particular bottleneck no longer exists.

siri · March 31, 2019, 12:33am

Let me try an analogy: The hose from which you are drinking always delivers the same amount of water, regardless of whether you take frequent small sips or less frequent large gulps.

Memory chips can only deliver data at a certain maximum rate (primarily a function of the bit-width of the memory interface and its operating frequency). If that rate is already exhausted using narrow accesses, using wider accesses won’t cause any more data to flow.

There may be other bottlenecks in a processor that favor the use of large accesses. Such a bottleneck existed in early CUDA-enabled GPUs, where throughput could also be limited due to the limited depth of the load/store queue. Since each access would take up an entry in the queue, regardless of width, using wide accesses allowed for queuing up more total work. That was more than ten years ago and this particular bottleneck no longer exists.

Hi njuffa, thanks very much for replying.

I used nvprof to get throughput which is 331.14GB/s, this number actually not achieve v100’s memory bandwidth 900GB/s, so I think i did not make full use of it. Maybe I want to know, how to achieve the theory throughput if I cannot do that with load vector4?

njuffa · March 31, 2019, 12:41am

I didn’t look at your code. Reviewing other people’s code brings me no joy. Below is ready-to-run code you can use to measure the bandwidth of your GPU. You can also take a look at the bandwidthTest sample app that is distributed with CUDA.

The code below uses the 8-byte double type, but you can easily change it to use other types. The GPU hardware supports 4-byte, 8-byte, and 16-byte memory accesses, and my expectation is that you will find that regardless of access width the memory throughput will be within 2% of each other.

In general expect to see measured throughput maxing out at around 80% of theoretical throughput. That applies to GPUs but also to the system memory attached to your CPU, and is essentially due to various DRAM limitations.

#include <stdlib.h>
#include <stdio.h>

#define DCOPY_THREADS  128
#define DCOPY_DEFLEN   20000000
#define DCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void dcopy (const double * __restrict__ src, 
                       double * __restrict__ dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];
    }
}    

struct dcopyOpts {
    int len;
};

static int processArgs (int argc, char *argv[], struct dcopyOpts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double start, stop, elapsed, mintime;
    double *d_a, *d_b;
    int errors;
    struct dcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : DCOPY_DEFLEN;

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(DCOPY_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    if (threadBlocks > 65520) threadBlocks = 65520;
    dim3 dimGrid(threadBlocks);
    
    printf ("dcopy: operating on vectors of %d doubles (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("dcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < DCOPY_ITER; k++) {
        start = second();
        dcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("dcopy: mintime = %.3f msec  throughput = %.2f GB/sec\n",
            1.0e3 * mintime, (2.0e-9 * sizeof(d_a[0]) * opts.len) / mintime);

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;
}

siri · March 31, 2019, 1:04pm

I didn’t look at your code. Reviewing other people’s code brings me no joy. Below is ready-to-run code you can use to measure the bandwidth of your GPU. You can also take a look at the bandwidthTest sample app that is distributed with CUDA.

The code below uses the 8-byte double type, but you can easily change it to use other types. The GPU hardware supports 4-byte, 8-byte, and 16-byte memory accesses, and my expectation is that you will find that regardless of access width the memory throughput will be within 2% of each other.

In general expect to see measured throughput maxing out at around 80% of theoretical throughput. That applies to GPUs but also to the system memory attached to your CPU, and is essentially due to various DRAM limitations.

#include <stdlib.h>
#include <stdio.h>

#define DCOPY_THREADS  128
#define DCOPY_DEFLEN   20000000
#define DCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void dcopy (const double * __restrict__ src, 
                       double * __restrict__ dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];
    }
}    

struct dcopyOpts {
    int len;
};

static int processArgs (int argc, char *argv[], struct dcopyOpts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double start, stop, elapsed, mintime;
    double *d_a, *d_b;
    int errors;
    struct dcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : DCOPY_DEFLEN;

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(DCOPY_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    if (threadBlocks > 65520) threadBlocks = 65520;
    dim3 dimGrid(threadBlocks);
    
    printf ("dcopy: operating on vectors of %d doubles (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("dcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < DCOPY_ITER; k++) {
        start = second();
        dcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("dcopy: mintime = %.3f msec  throughput = %.2f GB/sec\n",
            1.0e3 * mintime, (2.0e-9 * sizeof(d_a[0]) * opts.len) / mintime);

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;
}

I know why I got different result…Because I use “nvprof --metrics gld_throughput” to get bandwidth, but you calculate bandwidth with time cost, so can I trust nvprof’s result…

Robert_Crovella · March 31, 2019, 3:03pm

gld_throughput is global load throughput. It only takes into account loads, not stores, but your copy routine uses bandwidth (equal amounts, roughly) for loads AND stores.

So if you only used gld_throughput, I believe your method is broken.

gld_throughput also doesn’t tell you what is happening in the cache, exactly. I realize bulk copying shouldn’t have much cache benefit, but in the general case it might not be a good idea to compute achieved memory bandwidth using that metric.

Instead there are metrics that go directly at the dram interface, such as the dram_* metrics.

siri · March 31, 2019, 3:26pm

Thanks Robert.
It seems no need to do optimization for load and store on today’s GPU, because they can do better…

Jimmy_Pettersson · April 1, 2019, 9:23am

you should also try increasing your problem size from a miniscule 10^6 elements.

siri · April 2, 2019, 7:36am

thanks Jimmy, I tried 10^9 size, ldv4 is better than ld1

Topic		Replies	Views
CUDA Pro Tip: Increase Performance with Vectorized Memory Access Technical Blog	25	2893	August 25, 2025
Why using vectorized loads is more efficient? CUDA Programming and Performance	8	3907	September 6, 2024
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13779	June 2, 2008
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	7155	November 2, 2010
Effective global memory bandwidth? CUDA Programming and Performance	17	17726	September 18, 2007
Vehicle Routing Problem with CUDA CUDA Programming and Performance	18	4543	January 14, 2010
Batch load to utilize bandwidth instead of vector load CUDA Programming and Performance	1	316	November 17, 2022
efficient global memory access 32-, 64- or 128-bit loads ? CUDA Programming and Performance	9	4913	January 7, 2008
How to get the most dot products of batched vectors out of L4 GPU CUDA Programming and Performance	13	382	May 28, 2025
why 256byte loads slower than 128byte loads? CUDA Programming and Performance	6	7076	February 11, 2010

why load vector4 not faster than single load?

Related topics