Is there a difference between atomicAdd and atomicExch in hardware implementation?

Is there a difference between atomicAdd and atomicExch in hardware implementation?
Hi,
I got a problem using NCU to profile metric “l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom”.
NCU out put is difference between atomicAdd and atomicExch.

Device: NVIDIA GeForce RTX 3070 (GA104)
Code:

#include <cuda_wrapper.h>
#include <stdio.h>

#define BLOCK_SIZE 256
#define RESTRICTION_SIZE 32

__global__ void AtomicOnGlobalMem(int *data, int nElem)
{
    unsigned int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
    for (unsigned int i = tid; i < nElem; i += blockDim.x * gridDim.x)
    {
        atomicAdd(data + i, 6);        //pct=0     ratio=0 
        //atomicExch(data + i, 6);     //pct=12.50 ratio=4
    }
}

int main(void)
{
    const int n = 2 << 24;
    int *data = new int[n];

    int i;
    for (i = 0; i < n; i++)
    {
        data[i] = i % 1024 + 1;
    }

    int *dev_data;
    cudaMalloc((void **)&dev_data, sizeof(int) * size_t(n));
    cudaMemset(dev_data, 0, sizeof(int) * size_t(n));
    cudaMemcpy(dev_data, data, n * sizeof(int), cudaMemcpyHostToDevice);

    delete []data;

    for (int i = 0; i < 1; i++)
    {
        dim3 blocksize(BLOCK_SIZE);
        dim3 griddize((12 * 2048) / BLOCK_SIZE); 
        AtomicOnGlobalMem<<<griddize, blocksize>>>(dev_data, n);
        cudaPeekAtLastError();
    }
    cudaDeviceSynchronize();

    cudaFree(dev_data);
    cudaDeviceReset();
    
    return 0;
}

command:

sudo $(which ncu) --metrics l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom ./bin/case_atom

atomicAdd() output:

---------------------------------------------------------------------- --------------- ------------------------------
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom.pct                 %                              0
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom.ratio  sector/request                              0
---------------------------------------------------------------------- --------------- ------------------------------

atomicExch() output:

---------------------------------------------------------------------- --------------- ------------------------------
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom.pct                 %                          12.50
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom.ratio  sector/request                              4
---------------------------------------------------------------------- --------------- ------------------------------

Further more, I tested other atomic operations, atomicCAS is same as atomicExch, and others are like atomicAdd.

What’s the reason for that?

hello, please help…

Nsight Compute does not consider the high-level CUDA-C in terms of performance metrics, but the actual executed SASS instructions. You can inspect them yourself using e.g. the nvdisasm tool shipped with the CUDA toolkit, or even easier by opening your Nsight Compute report in the ncu-ui and checking the Source page. When compiling your application with -lineinfo, Nsight Compute can correlate high-level CUDA-C with the corresponding SASS.

When testing your app, on my GPU and CUDA toolkit, I see that the compiler implements atomicExch as ATOMG.E.EXCH, while atomicAdd is implemented as RED.E.ADD. Reductions and atomics are different HW metrics, so you would see it counted for l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_red, instead. For simplicity, you can collect e.g.

--metrics regex:l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_.*pct

to get the overview across all possible lsu_mem_global_op variants at the same time.

l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom.pct                 %                              0
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.pct                   %                              0
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_red.pct                  %                          12.50
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.pct                   %                              0
1 Like

OH! that’s incisive!
Thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.