The atomicCAS
function used on managed memory performs very badly after the driver is advised the memory will be used by the device with:
cuMemAdvise(var, memSize, CU_MEM_ADVISE_SET_ACCESSED_BY, dev);
Here’s the reproducer - commenting out the line above changes the performance drastically.
#include <cuda.h>
#include <cstdio>
__global__ void testKernel(int* var) {
atomicCAS(var, threadIdx.x, threadIdx.x + 1);
}
int main() {
CUdeviceptr var;
unsigned int memSize{sizeof(int)};
cuInit(0);
int iDevice{0};
CUdevice dev;
cuDeviceGet(&dev, iDevice);
CUcontext ctx;
cuDevicePrimaryCtxRetain(&ctx, dev);
cuCtxSetCurrent(ctx);
cuMemAllocManaged(&var, memSize, CU_MEM_ATTACH_GLOBAL);
cuMemAdvise(var, memSize, CU_MEM_ADVISE_SET_ACCESSED_BY, dev);
int* h_var = reinterpret_cast<int*>(var);
*h_var = 0;
testKernel<<<64,256>>>(h_var);
cuCtxSynchronize();
printf("output: %d\n", *h_var);
cuMemFree(var);
return 0;
}
Compiled with nvcc -o test test.cu -lcuda
, using CUDA 12.5 on Ubuntu 22.04, running with GeForce RTX 3060.
Checking with nsys profile --stats=true ./test
I get without the cuMemAdvise
:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- -------- -------- ----------- -----------------
100.0 123,170 1 123,170.0 123,170.0 123,170 123,170 0.0 testKernel(int *)
and with the cuMemAdvise
I get:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- --------- --------- ----------- -----------------
100.0 8,158,069 1 8,158,069.0 8,158,069.0 8,158,069 8,158,069 0.0 testKernel(int *)
Checking with Nsight Compute, I see the latter writes a lot more to device global memory (the baseline is the version without cuMemAdvise
):
I checked that other atomic functions like add/sub/max only slow down around 30%, but for some reason atomicCAS
slows down by a factor of >50x.
Is this expected at all that advising the driver about managed memory being used mainly from device slows down atomic operations on that memory from a kernel running on that device? Why is atomicCAS
affected so drastically?