why load vector4 not faster than single load?

Hi, I compared load vector4 and load float, but it seems load vector4 is not faster than load float when i set blocksize=512, could anyone help to explain this?

HW: v100

global void device_copy_vector4_kernel(int* d_in, int* d_out, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for(int i = idx; i < N/4; i += blockDim.x * gridDim.x) {
reinterpret_cast<int4*>(d_out)[i] = reinterpret_cast<int4*>(d_in)[i];

global void device_copy_vector1_kernel(int* d_in, int* d_out, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for(int i = idx; i < N; i += blockDim.x * gridDim.x) {
d_out[i] = d_in[i];


void device_copy_vector4(int* d_in, int* d_out, int N) {
int threads = 512;
int blocks = min((N/4 + threads-1) / threads, MAX_BLOCKS);

device_copy_vector4_kernel<<<blocks, threads>>>(d_in, d_out, N);

int main()
const int size = 1024000;
int *in = new int;
int *out = new int;

    for (int i = 0; i < size; ++i){
        in[i] = i;
    int *d_i, *d_o;
    cudaMalloc((void**)&d_i, sizeof(int)*size);
    cudaMalloc((void**)&d_o, sizeof(int)*size);
    cudaMemcpy(d_i, in, sizeof(int)*size, cudaMemcpyHostToDevice);

    device_copy_vector4((int*)d_i, (int*)d_o, size);
    delete []in;
    delete []out;
    return 0;


Let me try an analogy: The hose from which you are drinking always delivers the same amount of water, regardless of whether you take frequent small sips or less frequent large gulps.

Memory chips can only deliver data at a certain maximum rate (primarily a function of the bit-width of the memory interface and its operating frequency). If that rate is already exhausted using narrow accesses, using wider accesses won’t cause any more data to flow.

There may be other bottlenecks in a processor that favor the use of large accesses. Such a bottleneck existed in early CUDA-enabled GPUs, where throughput could also be limited due to the limited depth of the load/store queue. Since each access would take up an entry in the queue, regardless of width, using wide accesses allowed for queuing up more total work. That was more than ten years ago and this particular bottleneck no longer exists.

Hi njuffa, thanks very much for replying.

I used nvprof to get throughput which is 331.14GB/s, this number actually not achieve v100’s memory bandwidth 900GB/s, so I think i did not make full use of it. Maybe I want to know, how to achieve the theory throughput if I cannot do that with load vector4?

I didn’t look at your code. Reviewing other people’s code brings me no joy. Below is ready-to-run code you can use to measure the bandwidth of your GPU. You can also take a look at the bandwidthTest sample app that is distributed with CUDA.

The code below uses the 8-byte double type, but you can easily change it to use other types. The GPU hardware supports 4-byte, 8-byte, and 16-byte memory accesses, and my expectation is that you will find that regardless of access width the memory throughput will be within 2% of each other.

In general expect to see measured throughput maxing out at around 80% of theoretical throughput. That applies to GPUs but also to the system memory attached to your CPU, and is essentially due to various DRAM limitations.

#include <stdlib.h>
#include <stdio.h>

#define DCOPY_THREADS  128
#define DCOPY_DEFLEN   20000000
#define DCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#include <windows.h>
double second (void)
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
#error unsupported platform

__global__ void dcopy (const double * __restrict__ src, 
                       double * __restrict__ dst, int len)
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];

struct dcopyOpts {
    int len;

static int processArgs (int argc, char *argv[], struct dcopyOpts *opts)
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
    return error;

int main (int argc, char *argv[])
    double start, stop, elapsed, mintime;
    double *d_a, *d_b;
    int errors;
    struct dcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    opts.len = (opts.len) ? opts.len : DCOPY_DEFLEN;

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(DCOPY_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    if (threadBlocks > 65520) threadBlocks = 65520;
    dim3 dimGrid(threadBlocks);
    printf ("dcopy: operating on vectors of %d doubles (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("dcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < DCOPY_ITER; k++) {
        start = second();
        dcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    printf ("dcopy: mintime = %.3f msec  throughput = %.2f GB/sec\n",
            1.0e3 * mintime, (2.0e-9 * sizeof(d_a[0]) * opts.len) / mintime);

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;

I know why I got different result…Because I use “nvprof --metrics gld_throughput” to get bandwidth, but you calculate bandwidth with time cost, so can I trust nvprof’s result…

gld_throughput is global load throughput. It only takes into account loads, not stores, but your copy routine uses bandwidth (equal amounts, roughly) for loads AND stores.

So if you only used gld_throughput, I believe your method is broken.

gld_throughput also doesn’t tell you what is happening in the cache, exactly. I realize bulk copying shouldn’t have much cache benefit, but in the general case it might not be a good idea to compute achieved memory bandwidth using that metric.

Instead there are metrics that go directly at the dram interface, such as the dram_* metrics.

Thanks Robert.
It seems no need to do optimization for load and store on today’s GPU, because they can do better…

you should also try increasing your problem size from a miniscule 10^6 elements.

thanks Jimmy, I tried 10^9 size, ldv4 is better than ld1