Unbalanced Memory Read & Write

yz_chen1999 · June 29, 2023, 3:19am

I noticed an unbalanced memory read and write amount when profiling the underneath cuda kernel using ncu.

__global__ void kernel(void* mem, int n) {
    int* ptr = reinterpret_cast<int*>(mem);

    for (int offset = (threadIdx.x + blockIdx.x * blockDim.x)*32; offset < n; offset += blockDim.x * gridDim.x * 32) {
        #pragma unroll
        for (int i = 0; i < 16; i++) {
            ptr[offset + i] = ptr[offset + i + 16];
        }
    }
}

int main() {
    int* mem;

    int N = 1024 * 256 * 256;
    cudaMalloc((void**)&mem, sizeof(int) * N);
    cudaMemset(mem, 0, sizeof(int) * N);

    kernel<<<8192, 256>>>(mem, N);

    cudaFree(mem);

    return 0;
}

In ncu, it tells me that memory read is 305 MB while memory write is 1.07GB. I understand that there is global memory coalescing, but shouldn’t the memory read and write both be equal to approximately 1GB, instead of only 305 MB memory read?
And even if there is no global memory coalescing for memory read, shouldn’t the memory read amount be equal to around 128MB?

Thanks.

njuffa · June 29, 2023, 4:03am

If you look at the diagram, the numbers you cite refer to transfers between L1 and L2 cache, not reads and writes from device memory, which amount to 172 MB read and 136 MB written.

The amount of traffic between levels of the cache hierarchy will depend on any numbers of factors, such as the cache line length, replacement policy, write policy, cache coherency protocol, inclusive vs exclusive caches, possible victim cache strategy, sectored vs non-sectored organization etc. You would have to research these details for the caches of your particular GPU model (which hasn’t been mentioned unless I overlooked it), at which point the numbers reported probably would make sense.

yz_chen1999 · June 29, 2023, 8:12am

Thanks. I was running this code on A100 compiled by cu11.7. When I am digging in, I noticed that V100 GPU L1 cache has different granularities between loading and updating. But I do not see any similar analysis on A100. So I suppose this might be the reason why transfers between L1 and L2 cache differ?

Robert_Crovella · June 29, 2023, 3:09pm

The L1 on NVIDIA GPU architectures has typically been referred to as “write-through” (e.g. slide 43). This can result in a significant “imbalance” in L1<->L2 traffic for a “balanced” read/write code: writes have the potential to trigger traffic to the L2 on each write, reads have the potential to hit in L1, therefore not generating corresponding traffic to the L2.

Topic		Replies	Views
Memory transaction size CUDA Programming and Performance	1	1710	February 12, 2017
L1-L2-Global how to clearly describe their interaction for a given kernel CUDA Programming and Performance	3	2046	April 15, 2012
L2 cache read misses vs L2 cache write misses CUDA Programming and Performance	5	2446	February 5, 2014
why gld_efficiency only 50% when using L1 cache in Pascal CUDA Programming and Performance	0	525	September 21, 2018
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1286	August 28, 2019
May I ask what is the granularity of memory access from GPU for cudamallochost? If I wish to measure its granularity, how should I proceed? CUDA Programming and Performance cuda	13	642	January 3, 2024
Unexpected Data Read Behavior on Tesla V100: Cache Line and Memory Access Patterns Nsight Compute	2	509	August 31, 2023
Cache line size of L1 and L2 CUDA Programming and Performance	3	19866	November 14, 2011
Cuda memory optimization for Jetson Tx1 CUDA Programming and Performance	1	430	August 3, 2018
The granularity of L1 and L2 caches CUDA Programming and Performance cuda	2	690	April 18, 2024

Unbalanced Memory Read & Write

Related topics