atomicCAS poor performance with managed memory after cuMemAdvise

The atomicCAS function used on managed memory performs very badly after the driver is advised the memory will be used by the device with:

cuMemAdvise(var, memSize, CU_MEM_ADVISE_SET_ACCESSED_BY, dev);

Here’s the reproducer - commenting out the line above changes the performance drastically.

#include <cuda.h>
#include <cstdio>

__global__ void testKernel(int* var) {
  atomicCAS(var, threadIdx.x, threadIdx.x + 1);
}

int main() {

  CUdeviceptr var;
  unsigned int memSize{sizeof(int)};
  cuInit(0);

  int iDevice{0};
  CUdevice dev;
  cuDeviceGet(&dev, iDevice);
  CUcontext ctx;
  cuDevicePrimaryCtxRetain(&ctx, dev);
  cuCtxSetCurrent(ctx);

  cuMemAllocManaged(&var, memSize, CU_MEM_ATTACH_GLOBAL);
  cuMemAdvise(var, memSize, CU_MEM_ADVISE_SET_ACCESSED_BY, dev);

  int* h_var = reinterpret_cast<int*>(var);
  *h_var = 0;

  testKernel<<<64,256>>>(h_var);

  cuCtxSynchronize();

  printf("output: %d\n", *h_var);

  cuMemFree(var);

  return 0;
}

Compiled with nvcc -o test test.cu -lcuda, using CUDA 12.5 on Ubuntu 22.04, running with GeForce RTX 3060.

Checking with nsys profile --stats=true ./test I get without the cuMemAdvise:

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)        Name       
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  -----------------
    100.0          123,170          1  123,170.0  123,170.0   123,170   123,170          0.0  testKernel(int *)

and with the cuMemAdvise I get:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)        Name       
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  -----------------
    100.0        8,158,069          1  8,158,069.0  8,158,069.0  8,158,069  8,158,069          0.0  testKernel(int *)

Checking with Nsight Compute, I see the latter writes a lot more to device global memory (the baseline is the version without cuMemAdvise):

I checked that other atomic functions like add/sub/max only slow down around 30%, but for some reason atomicCAS slows down by a factor of >50x.

Is this expected at all that advising the driver about managed memory being used mainly from device slows down atomic operations on that memory from a kernel running on that device? Why is atomicCAS affected so drastically?

I won’t be able to offer a precise explanation, because AFAIK NVIDIA does not document the precise mechanism for atomics, including system atomics. But the behavior does not surprise me.

set accessed by is the wrong hint to use (evidently).

It is a hint that says, roughly:

“Leave the data where it is, and build a mapping so that this GPU can access that data over a relevant link (e.g. PCIE or NVLink).”

We can argue over that word-by-word, perhaps, but the description explicitly says:

This advice does not cause data migration and has no impact on the location of the data per se.

It also explicitly says its purpose is to build a “mapping” so that the referenced GPU can access it wherever it may be.

So what that means is that instead of migrating the data to the processor of interest (the GPU), the existence of the map and hint means that the GPU can access the data “wherever it is” and there is no specific requirement or impetus for the managed memory (MM) system to migrate the data.

If the data is migrated, then the atomic operation would be performed on the data in “local” device memory, pretty much like any other atomic on non-managed device memory.

If data is not migrated, the the atomic has to be done as a system atomic, which is done by generating some sort of sequence of bus cycles on the relevant bus (PCIE or NVLink). This is evidently and sensibly going to be slower.

Without that hint, the typical behavior of the MM subsystem is to migrate the data to the processor that touches it.

With the hint, the MM subsystem has the option to not migrate the data.

You could probably avoid the slowdown, while providing that hint, by explicitly migrating the data (cuMemPrefetch) before doing the atomics.

Anyway, although the actual behavior of the MM system is not specified to this level of detail (for valid reasons) the result is that the MM system has options, and you have specifically enabled an option that results in slower behavior.

In that sense, I would say it is “expected” behavior, or at least “plausible” and “valid” behavior.

atomicCAS is evidently/probably a special case. I don’t know the details, but system level atomics like atomicAdd may have a bus-driven variant that still roughly happens in a “single” bus cycle. My guess would be that the atomicCAS implementation probably has some sequence that it needs to follow on the bus, rather than a single cycle. That is just pure guesswork.

Or maybe migration eventually happens in the non-atomicCAS case(s), resulting in a lower performance hit, but the atomicCAS case doesn’t migrate the data as quickly.

Unit 6 of this online training series gives an overview of MM.

Thank you for the prompt reply! Your explanation makes sense, I have clearly misunderstood the memAdvise documentation and incorrectly thought “implies that the data will be accessed by device” means the access will be more optimal with the hint. I was originally looking to solve a different performance problem - frequent page faults when reading managed memory on the device in small chunks - by hinting the driver to prefetch the data more eagerly onto the device. Thanks to your reply I now realised this advice is explicitly not going to do what I was looking for, and it can actually hold the data on the host. I’ll need to study the documentation and the training materials more carefully!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.