i am trying to understand these four ncu metrics:
lts__t_sectors_srcnode_gpc_aperture_device_op_atom_dot_alu_lookup_hit lts__t_sectors_srcnode_gpc_aperture_device_op_atom_dot_alu_lookup_miss
lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_miss
so i did some test,here is my kernel:
__global__ void LtsTSectorsSrcnodeGpcApertureDeviceOpAtomDotAluLookupHitMissKernel(unsigned int *input) {
atomicAdd(&input[0], 10);
}
thread dim is (1,32). i use cudamalloc to test device memory and cudahostalloc to test sysmem, here is my result:
lts__t_sectors_srcnode_gpc_aperture_device_op_atom_dot_alu_lookup_hit result: 31
lts__t_sectors_srcnode_gpc_aperture_device_op_atom_dot_alu_lookup_miss result: 1
lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit result: 0
lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_miss result: 32
i use cuobjdump to see if there is any difference between two tests in saas,but only found out that they are same.
so i think the different result is caused by hardware, can anyone tell me why atomic operation on sysmem are always miss , no hit.
lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit # of LTS sectors from node GPC accessing system memory (sysmem) for atomic ALU (non-CAS) that hit
Referring to the memory hierarchy diagram that is available in nsight compute:
we see that the access to sysmem flows “through” the L2.
I think you may get a better/more authoritative answer on the ncu forum, however my guess is as follows. The L2 cache on the device acts as a “proxy” for most global space accesses that would be backed by device memory. Therefore in resolving an atomic in the L2, it would first be necessary to determine whether the atomic target is in the L2 cache (a “hit”) or not (a “miss”).
system memory (i.e. host memory accessible because it is pinned/paged-locked) is typically not cached in L2 by default. I don’t think this is well documented but it’s possible to ascertain with a relatively simple microbenchmarking test. As a result, I would expect a global space access that targets sysmem to “never hit”, i.e. always “miss” (in the L2).
BTW, if you think carefully about what a sysmem atomic implies, you might not want it to ever “hit” in the L2.
Also, regarding your question here note that the case there is a bit different. You’re not using atomics there. Based on my own testing, sysmem accesses are not cached in L2 but may be cached in L1. Atomics generally “bypass” the L1, and get “resolved” in the L2. But for “ordinary” accesses to sysmem, I believe it is possible to have “hits” (in the L1).
i did think about this: atomic functions (see Atomic Functions) operating on mapped page-locked memory are not atomic from the point of view of the host or other devices. i guess that in order to get the correct atomic result, SM should not read sysmem data from L2 cache.
can you show me your test codes? i did found that metrics about L2 cache and sysmem results are wierd, i want to do more tests.
Sure, here is a test case. The various numbers are chosen to be “about right” for a Tesla V100:
$ cat t1965.cu
#include <iostream>
#include <cstdlib>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start=0){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
#ifndef USE_VOLATILE
__global__ void k(T *d, size_t sz, int loops){
#else
__global__ void k(volatile T *d, size_t sz, int loops){
#endif
T val = 1;
for (int i = 0; i < loops; i++)
for (size_t idx = threadIdx.x+blockDim.x*blockIdx.x; idx < sz; idx += gridDim.x*blockDim.x)
val += d[idx];
if (val == 0) d[threadIdx.x] = val;
}
int main(int argc, char *argv[]){
int *d;
int fs = 8192*80;
size_t footprint = fs*sizeof(d[0]);
int loops = 64;
cudaHostAlloc(&d, footprint, cudaHostAllocDefault);
memset(d, 1, footprint);
unsigned long long dt = dtime_usec(0);
k<<<80, 1024>>>(d, fs, loops);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
cudaError_t err = cudaGetLastError();
if( err == cudaSuccess)
std::cout << "bandwidth: " << (footprint*loops)/(float)dt << "MB/s" << std::endl;
else
std::cout << "cuda error: " << cudaGetErrorString(err) << std::endl;
return 0;
}
$ nvcc -arch=sm_70 -o t1965 t1965.cu -lineinfo
$ ./t1965
bandwidth: 493448MB/s
$ nvcc -arch=sm_70 -o t1965 t1965.cu -lineinfo -DUSE_VOLATILE
$ ./t1965
bandwidth: 11909.7MB/s
$ /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla V100-PCIE-32GB
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 12.1
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 12.7
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 724.3
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$
When I don’t use the volatile decorator, the observed read bandwidth to sysmem is ~500GB/s. This obviously exceeds whatever the PCIE bus can supply, so a cache of some sort must be involved. When I add the volatile decorator to the data, the bandwidth drops to ~12GB/s which happens to be the PCIE bus bandwidth. Therefore I assume that no caching is in effect. Since the volatile keyword causes memory traffic to bypass the L1, I conclude that sysmem reads may be cached in L1 but don’t appear to be cached in L2.
the kernel just read 32 float data from input and write to output , first read will add 4 sectors read miss, and first write can also add 4 sectors write miss, so i think lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss should be 8 sectors including 4 sectors read miss and 4 sectors writes miss;
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit should be 0;
lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_hit should be 0;
lts__t_requests_srcnode_gpc_aperture_sysmem_lookup_miss should be 2;
but the ncu profiling results are different, can you explain why there are 4 sectors lookup_hit instead of another 4 lookup_miss miss.
in order to find out where the 4 sectors lookup_hit come from, i chang my kernel to only write data to output:
the results shows that there are 4 sector lookup_hit, this did explain my question in first kernel, but why there are 4 sectors hit with 1 lookup_miss requests, can you explain why? is this a bug of ncu?