Why atomic operation on sysmem always miss

i am trying to understand these four ncu metrics:
lts__t_sectors_srcnode_gpc_aperture_device_op_atom_dot_alu_lookup_hit lts__t_sectors_srcnode_gpc_aperture_device_op_atom_dot_alu_lookup_miss
lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_miss
so i did some test,here is my kernel:

__global__ void LtsTSectorsSrcnodeGpcApertureDeviceOpAtomDotAluLookupHitMissKernel(unsigned int *input) {
    atomicAdd(&input[0], 10);
}

thread dim is (1,32). i use cudamalloc to test device memory and cudahostalloc to test sysmem, here is my result:
lts__t_sectors_srcnode_gpc_aperture_device_op_atom_dot_alu_lookup_hit result: 31
lts__t_sectors_srcnode_gpc_aperture_device_op_atom_dot_alu_lookup_miss result: 1
lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit result: 0
lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_miss result: 32
i use cuobjdump to see if there is any difference between two tests in saas,but only found out that they are same.
so i think the different result is caused by hardware, can anyone tell me why atomic operation on sysmem are always miss , no hit.

What does this metric measure? Cache hits / misses?

If I run the following command on a system with Tesla V100:

$ ncu --query-metrics |grep lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit

I get some description:

lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit       # of LTS sectors from node GPC accessing system memory (sysmem) for atomic ALU (non-CAS) that hit

Referring to the memory hierarchy diagram that is available in nsight compute:

example

we see that the access to sysmem flows “through” the L2.

I think you may get a better/more authoritative answer on the ncu forum, however my guess is as follows. The L2 cache on the device acts as a “proxy” for most global space accesses that would be backed by device memory. Therefore in resolving an atomic in the L2, it would first be necessary to determine whether the atomic target is in the L2 cache (a “hit”) or not (a “miss”).

system memory (i.e. host memory accessible because it is pinned/paged-locked) is typically not cached in L2 by default. I don’t think this is well documented but it’s possible to ascertain with a relatively simple microbenchmarking test. As a result, I would expect a global space access that targets sysmem to “never hit”, i.e. always “miss” (in the L2).

BTW, if you think carefully about what a sysmem atomic implies, you might not want it to ever “hit” in the L2.

Also, regarding your question here note that the case there is a bit different. You’re not using atomics there. Based on my own testing, sysmem accesses are not cached in L2 but may be cached in L1. Atomics generally “bypass” the L1, and get “resolved” in the L2. But for “ordinary” accesses to sysmem, I believe it is possible to have “hits” (in the L1).

i did think about this: atomic functions (see Atomic Functions) operating on mapped page-locked memory are not atomic from the point of view of the host or other devices. i guess that in order to get the correct atomic result, SM should not read sysmem data from L2 cache.

can you show me your test codes? i did found that metrics about L2 cache and sysmem results are wierd, i want to do more tests.

Sure, here is a test case. The various numbers are chosen to be “about right” for a Tesla V100:

$ cat t1965.cu
#include <iostream>
#include <cstdlib>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}


template <typename T>
#ifndef USE_VOLATILE
__global__ void k(T *d, size_t sz, int loops){
#else
__global__ void k(volatile T *d, size_t sz, int loops){
#endif
  T val = 1;
  for (int i = 0; i < loops; i++)
    for (size_t idx = threadIdx.x+blockDim.x*blockIdx.x; idx < sz; idx += gridDim.x*blockDim.x)
      val += d[idx];
  if (val == 0) d[threadIdx.x] = val;
}
int main(int argc, char *argv[]){

  int *d;
  int fs = 8192*80;
  size_t footprint = fs*sizeof(d[0]);
  int loops = 64;
  cudaHostAlloc(&d, footprint, cudaHostAllocDefault);
  memset(d, 1, footprint);
  unsigned long long dt = dtime_usec(0);
  k<<<80, 1024>>>(d, fs, loops);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  cudaError_t err = cudaGetLastError();
  if( err == cudaSuccess)
    std::cout << "bandwidth: " << (footprint*loops)/(float)dt << "MB/s" << std::endl;
  else
    std::cout << "cuda error: " << cudaGetErrorString(err) << std::endl;
  return 0;
}
$ nvcc -arch=sm_70 -o t1965 t1965.cu -lineinfo
$ ./t1965
bandwidth: 493448MB/s
$ nvcc -arch=sm_70 -o t1965 t1965.cu -lineinfo -DUSE_VOLATILE
$ ./t1965
bandwidth: 11909.7MB/s
$ /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla V100-PCIE-32GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     12.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     12.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     724.3

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$

When I don’t use the volatile decorator, the observed read bandwidth to sysmem is ~500GB/s. This obviously exceeds whatever the PCIE bus can supply, so a cache of some sort must be involved. When I add the volatile decorator to the data, the bandwidth drops to ~12GB/s which happens to be the PCIE bus bandwidth. Therefore I assume that no caching is in effect. Since the volatile keyword causes memory traffic to bypass the L1, I conclude that sysmem reads may be cached in L1 but don’t appear to be cached in L2.

while i am doing some other tests, i found some other wierd metrics results:
test 1, thread dim is (1,32) , :

__global__ void kernel1(float *input, float *output, int len) {
    output[threadIdx.x] = input[threadIdx.x];
}

void test1() {
    int thread_num = 32;
    float *d_src = nullptr;
    float *d_res = nullptr;
    cudaHostAlloc(&d_src, thread_num * sizeof(float), cudaHostAllocDefault);
    cudaHostAlloc(&d_res, thread_num  * sizeof(float), cudaHostAllocDefault);
    CUDA_CHECK(cudaSetDevice(0));
    // kernel
    kernel1<<<1, thread_num>>>(d_src, d_res);
    CUDA_CHECK(cudaStreamSynchronize(0));
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaFreeHost(d_src));
    CUDA_CHECK(cudaFreeHost(d_res));
}

collect metrics result :

 sudo ncu --metrics lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.sum,lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.sum,lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_miss.sum,lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_hit.sum  bin/test_lts_sector_GPC 
==PROF== Connected to process 6073 (/home/yongjian/CUDA_Note/build/bin/test_lts_sector_GPC)
==PROF== Profiling "Kernel1(float *, float *)" - 1: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 6073
[6073] test_lts_sector_GPC@127.0.0.1
  Kernel1(float *, float *), 2022-Feb-09 19:39:30, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_hit.sum                     request                              0
    lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_miss.sum                    request                              2
    lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.sum                       sector                              4
    lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.sum                      sector                              4
    ---------------------------------------------------------------------- --------------- ------------------------------

the kernel just read 32 float data from input and write to output , first read will add 4 sectors read miss, and first write can also add 4 sectors write miss, so i think lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss should be 8 sectors including 4 sectors read miss and 4 sectors writes miss;
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit should be 0;
lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_hit should be 0;
lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_miss should be 2;
but the ncu profiling results are different, can you explain why there are 4 sectors lookup_hit instead of another 4 lookup_miss miss.

in order to find out where the 4 sectors lookup_hit come from, i chang my kernel to only write data to output:

__global__ void Kernel1(float *input, float *output) { 
output[threadIdx.x] = threadIdx.x;
 }

void Test1() {
    int thread_num = 32;
    float *d_src = nullptr;
    float *d_res = nullptr;
    cudaHostAlloc(&d_src, thread_num * sizeof(float), cudaHostAllocDefault);
    cudaHostAlloc(&d_res, thread_num * sizeof(float), cudaHostAllocDefault);
    CUDA_CHECK(cudaSetDevice(0));
    // kernel
    Kernel1<<<1, thread_num>>>(d_src, d_res);
    CUDA_CHECK(cudaStreamSynchronize(0));
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaFreeHost(d_src));
    CUDA_CHECK(cudaFreeHost(d_res));
}

collect metrics result :

sudo ncu --metrics lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.sum,lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.sum,lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_miss.sum,lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_hit.sum  bin/test_lts_sector_GPC 
==PROF== Connected to process 11346 (/home/yongjian/CUDA_Note/build/bin/test_lts_sector_GPC)
==PROF== Profiling "Kernel1(float *, float *)" - 1: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 11346
[11346] test_lts_sector_GPC@127.0.0.1
  Kernel1(float *, float *), 2022-Feb-09 19:49:50, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_hit.sum                     request                              0
    lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_miss.sum                    request                              1
    lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.sum                       sector                              4
    lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.sum                      sector                              0
    ---------------------------------------------------------------------- --------------- ------------------------------

the results shows that there are 4 sector lookup_hit, this did explain my question in first kernel, but why there are 4 sectors hit with 1 lookup_miss requests, can you explain why?
is this a bug of ncu?

1 Like

i did some test found that:
sysmem and global memory reads(no atomic) are cached in L1 , but atomic operation on sysmem are not caches in L1 or L2.

atomic on system memory seem to really do not cached int L1,here is my test code:

#define CUDA_CHECK(status)                                                                                           \
    do {                                                                                                             \
        auto ret = (status);                                                                                         \
        if (ret != 0) {                                                                                              \
            throw std::runtime_error("cuda failure: " + std::to_string(ret) + " (" + cudaGetErrorString(ret) + ")" + \
                                     " at " + __FILE__ + ":" + std::to_string(__LINE__));                            \
        }                                                                                                            \
    } while (0)

#define CUDA_FREE(device_ptr)     \
    do {                          \
        if (device_ptr) {         \
            cudaFree(device_ptr); \
        }                         \
    } while (0)

#define CUDA_FREE_HOST(device_ptr)    \
    do {                              \
        if (device_ptr) {             \
            cudaFreeHost(device_ptr); \
        }                             \
    } while (0)

__global__ void LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel0(unsigned int *input,
                                                                             unsigned int *output) {
    output[threadIdx.x] = atomicAdd(&input[0], 10);
}

__global__ void LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel1(unsigned int *input,
                                                                             unsigned int *output) {
    output[threadIdx.x] = atomicMax(&input[0], 10);
}

void LtsTSectorsSrcnodeGpcApertureSysmemOpAtomDotAluLookupHitMiss() {
    const int test_num = 2;
    int grid_num[test_num]{1, 1};
    int thread_num[test_num]{32, 32};

    unsigned int *d_src[test_num] = {nullptr};
    unsigned int *d_res[test_num] = {nullptr};

    for (int i = 0; i < 2; i++) {
        dim3 block(thread_num[i]);
        dim3 grid(grid_num[i]);

        // CUDA_CHECK(cudaMalloc(&d_src[i], grid_num[i] * thread_num[i] * sizeof(float)));
        // CUDA_CHECK(cudaMalloc(&d_res[i], grid_num[i] * thread_num[i] * sizeof(float)));
        CUDA_CHECK(cudaHostAlloc(&d_src[i], grid_num[i] * thread_num[i] * sizeof(float), cudaHostAllocDefault));
        CUDA_CHECK(cudaHostAlloc(&d_res[i], grid_num[i] * thread_num[i] * sizeof(float), cudaHostAllocDefault));
        CUDA_CHECK(cudaSetDevice(0));
        // kernel
        switch (i) {
            case 0: {
                LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel0<<<grid, block>>>(d_src[i], d_res[i]);
                break;
            }

            case 1: {
                LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel1<<<grid, block>>>(d_src[i], d_res[i]);
                break;
            }
            default:
                break;
        }

        CUDA_CHECK(cudaStreamSynchronize(0));
        CUDA_CHECK(cudaGetLastError());
        // CUDA_FREE(d_src[i]);
        // CUDA_FREE(d_res[i]);
        CUDA_FREE_HOST(d_src[i]);
        CUDA_FREE_HOST(d_res[i]);
    }
}
int main(){
    std::cout << "lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit "
                 "lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_miss test"
              << std::endl;
    CUDA_CHECK(cudaSetDevice(0));
    LtsTSectorsSrcnodeGpcApertureSysmemOpAtomDotAluLookupHitMiss();
return 0
}

test result:

sudo ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_atom,l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_hit,l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_miss  bin/test_lts_sector_GPC 
lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_miss test
==PROF== Connected to process 19097 (/home/yongjian/CUDA_Note/build/bin/test_lts_sector_GPC)
==PROF== Profiling "LtsTSectorsSrcnodeGpcAperture..." - 1: 0%....50%....100% - 1 pass
==PROF== Profiling "LtsTSectorsSrcnodeGpcAperture..." - 2: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 19097
[19097] test_lts_sector_GPC@127.0.0.1
  LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel0(unsigned int *, unsigned int *), 2022-Feb-15 17:01:54, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom.avg                                sector                           0,01
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom.max                                sector                              1
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom.min                                sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom.sum                                sector                              1
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_hit.avg                     sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_hit.max                     sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_hit.min                     sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_hit.sum                     sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_miss.avg                    sector                           0,01
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_miss.max                    sector                              1
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_miss.min                    sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_miss.sum                    sector                              1
    ---------------------------------------------------------------------- --------------- ------------------------------

  LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel1(unsigned int *, unsigned int *), 2022-Feb-15 17:01:54, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom.avg                                sector                           0,47
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom.max                                sector                             32
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom.min                                sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom.sum                                sector                             32
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_hit.avg                     sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_hit.max                     sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_hit.min                     sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_hit.sum                     sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_miss.avg                    sector                           0,47
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_miss.max                    sector                             32
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_miss.min                    sector                              0
    l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_miss.sum                    sector                             32
    ---------------------------------------------------------------------- --------------- ------------------------------


l1txt metrics about atomic are also all miss no hit