Is there a difference between atomicAdd and atomicExch in hardware implementation?
Hi,
I got a problem using NCU to profile metric “l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom”.
NCU out put is difference between atomicAdd and atomicExch.
Device: NVIDIA GeForce RTX 3070 (GA104)
Code:
#include <cuda_wrapper.h>
#include <stdio.h>
#define BLOCK_SIZE 256
#define RESTRICTION_SIZE 32
__global__ void AtomicOnGlobalMem(int *data, int nElem)
{
unsigned int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
for (unsigned int i = tid; i < nElem; i += blockDim.x * gridDim.x)
{
atomicAdd(data + i, 6); //pct=0 ratio=0
//atomicExch(data + i, 6); //pct=12.50 ratio=4
}
}
int main(void)
{
const int n = 2 << 24;
int *data = new int[n];
int i;
for (i = 0; i < n; i++)
{
data[i] = i % 1024 + 1;
}
int *dev_data;
cudaMalloc((void **)&dev_data, sizeof(int) * size_t(n));
cudaMemset(dev_data, 0, sizeof(int) * size_t(n));
cudaMemcpy(dev_data, data, n * sizeof(int), cudaMemcpyHostToDevice);
delete []data;
for (int i = 0; i < 1; i++)
{
dim3 blocksize(BLOCK_SIZE);
dim3 griddize((12 * 2048) / BLOCK_SIZE);
AtomicOnGlobalMem<<<griddize, blocksize>>>(dev_data, n);
cudaPeekAtLastError();
}
cudaDeviceSynchronize();
cudaFree(dev_data);
cudaDeviceReset();
return 0;
}
command:
sudo $(which ncu) --metrics l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom ./bin/case_atom
atomicAdd() output:
---------------------------------------------------------------------- --------------- ------------------------------
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom.pct % 0
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom.ratio sector/request 0
---------------------------------------------------------------------- --------------- ------------------------------
atomicExch() output:
---------------------------------------------------------------------- --------------- ------------------------------
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom.pct % 12.50
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_atom.ratio sector/request 4
---------------------------------------------------------------------- --------------- ------------------------------
Further more, I tested other atomic operations, atomicCAS is same as atomicExch, and others are like atomicAdd.
What’s the reason for that?