What is the best way to load global memory?

user162725 · April 19, 2022, 3:29am

Hi,
I think load1 is the best way to load global memory to achieve the highest bandwidth, but I see code like load4 somewhere and it’s 2~3% faster than load1, why?
Tested on A100.

#include <cstdio>

#include <ctime>

constexpr int n = 1 << 30;

constexpr int BLOCK = 1024;

constexpr int B = 4;

__global__ void load1(const float *__restrict__ src, float *__restrict__ dst)

{

    const int X = threadIdx.x;

    const int st = blockIdx.x * (n / BLOCK);

    float sum = 0;

#pragma runroll

    for (int i = 0; i < n / BLOCK; i += 32)

    {

        sum += src[st + i + X];

    }

    dst[blockIdx.x * 32 + X] = sum;

}

__global__ void load4(const float *__restrict__ src, float *__restrict__ dst)

{

    const int X = threadIdx.x;

    const int st = blockIdx.x * (n / BLOCK);

    float sum = 0;

#pragma unroll

    for (int i = 0; i < n / BLOCK; i += B * 32)

    {

#pragma unroll

        for (int j = 0; j < B; j++)

        {

            sum += src[st + i + X * B + j];

        }

    }

    dst[blockIdx.x * 32 + X] = sum;

}

int main()

{

    float *a, *b;

    cudaMalloc(&a, n * sizeof(float));

    cudaMalloc(&b, BLOCK * 32);

    for (int i = 0; i < 10; i++)

    {

        clock_t st, en;

        cudaMemset(a, 0, n * sizeof(float));

        cudaDeviceSynchronize();

        st = clock();

        load1<<<BLOCK, 32>>>(a, b);

        cudaDeviceSynchronize();

        en = clock();

        clock_t t1 = en - st;

        cudaMemset(a, 0, n * sizeof(float));

        cudaDeviceSynchronize();

        st = clock();

        load4<<<BLOCK, 32>>>(a, b);

        cudaDeviceSynchronize();

        en = clock();

        clock_t t4 = en - st;

        printf("%ld %ld %lf\n", t1, t4, t1 * 1.0 / t4);

    }

    cudaFree(a);

    cudaFree(b);

}

Robert_Crovella · April 19, 2022, 2:34pm

please don’t post code as pictures

user162725 · April 20, 2022, 2:11am

Sry, changed.

Topic		Replies	Views
efficient global memory access 32-, 64- or 128-bit loads ? CUDA Programming and Performance	9	4913	January 7, 2008
why load vector4 not faster than single load? CUDA Programming and Performance	8	852	April 2, 2019
Batch load to utilize bandwidth instead of vector load CUDA Programming and Performance	1	316	November 17, 2022
Data load question CUDA Programming and Performance	3	99	December 18, 2024
Why my global load efficiency always 50% CUDA Programming and Performance	4	816	January 4, 2018
Effective global memory bandwidth? CUDA Programming and Performance	17	17726	September 18, 2007
Reading from global memory to registers in a fast way CUDA Programming and Performance	10	2342	November 15, 2021
Low global load efficiency, why? I think this access pattern should be coalesced CUDA Programming and Performance	4	3855	May 12, 2012
Global memory access CUDA Programming and Performance	2	809	August 10, 2016
Bandwidth of reading data from global device memory CUDA Programming and Performance	1	3472	June 27, 2011

What is the best way to load global memory?

Related topics