Unbalanced Memory Read & Write

I noticed an unbalanced memory read and write amount when profiling the underneath cuda kernel using ncu.

__global__ void kernel(void* mem, int n) {
    int* ptr = reinterpret_cast<int*>(mem);

    for (int offset = (threadIdx.x + blockIdx.x * blockDim.x)*32; offset < n; offset += blockDim.x * gridDim.x * 32) {
        #pragma unroll
        for (int i = 0; i < 16; i++) {
            ptr[offset + i] = ptr[offset + i + 16];
        }
    }
}

int main() {
    int* mem;

    int N = 1024 * 256 * 256;
    cudaMalloc((void**)&mem, sizeof(int) * N);
    cudaMemset(mem, 0, sizeof(int) * N);

    kernel<<<8192, 256>>>(mem, N);

    cudaFree(mem);

    return 0;
}

In ncu, it tells me that memory read is 305 MB while memory write is 1.07GB. I understand that there is global memory coalescing, but shouldn’t the memory read and write both be equal to approximately 1GB, instead of only 305 MB memory read?
And even if there is no global memory coalescing for memory read, shouldn’t the memory read amount be equal to around 128MB?

Thanks.

If you look at the diagram, the numbers you cite refer to transfers between L1 and L2 cache, not reads and writes from device memory, which amount to 172 MB read and 136 MB written.

The amount of traffic between levels of the cache hierarchy will depend on any numbers of factors, such as the cache line length, replacement policy, write policy, cache coherency protocol, inclusive vs exclusive caches, possible victim cache strategy, sectored vs non-sectored organization etc. You would have to research these details for the caches of your particular GPU model (which hasn’t been mentioned unless I overlooked it), at which point the numbers reported probably would make sense.

Thanks. I was running this code on A100 compiled by cu11.7. When I am digging in, I noticed that V100 GPU L1 cache has different granularities between loading and updating. But I do not see any similar analysis on A100. So I suppose this might be the reason why transfers between L1 and L2 cache differ?

The L1 on NVIDIA GPU architectures has typically been referred to as “write-through” (e.g. slide 43). This can result in a significant “imbalance” in L1<->L2 traffic for a “balanced” read/write code: writes have the potential to trigger traffic to the L2 on each write, reads have the potential to hit in L1, therefore not generating corresponding traffic to the L2.