atomicCAS poor performance with managed memory after cuMemAdvise

rafal.bielski · July 17, 2024, 3:54pm

The atomicCAS function used on managed memory performs very badly after the driver is advised the memory will be used by the device with:

cuMemAdvise(var, memSize, CU_MEM_ADVISE_SET_ACCESSED_BY, dev);

Here’s the reproducer - commenting out the line above changes the performance drastically.

#include <cuda.h>
#include <cstdio>

__global__ void testKernel(int* var) {
  atomicCAS(var, threadIdx.x, threadIdx.x + 1);
}

int main() {

  CUdeviceptr var;
  unsigned int memSize{sizeof(int)};
  cuInit(0);

  int iDevice{0};
  CUdevice dev;
  cuDeviceGet(&dev, iDevice);
  CUcontext ctx;
  cuDevicePrimaryCtxRetain(&ctx, dev);
  cuCtxSetCurrent(ctx);

  cuMemAllocManaged(&var, memSize, CU_MEM_ATTACH_GLOBAL);
  cuMemAdvise(var, memSize, CU_MEM_ADVISE_SET_ACCESSED_BY, dev);

  int* h_var = reinterpret_cast<int*>(var);
  *h_var = 0;

  testKernel<<<64,256>>>(h_var);

  cuCtxSynchronize();

  printf("output: %d\n", *h_var);

  cuMemFree(var);

  return 0;
}

Compiled with nvcc -o test test.cu -lcuda, using CUDA 12.5 on Ubuntu 22.04, running with GeForce RTX 3060.

Checking with nsys profile --stats=true ./test I get without the cuMemAdvise:

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)        Name       
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  -----------------
    100.0          123,170          1  123,170.0  123,170.0   123,170   123,170          0.0  testKernel(int *)

and with the cuMemAdvise I get:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)        Name       
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  -----------------
    100.0        8,158,069          1  8,158,069.0  8,158,069.0  8,158,069  8,158,069          0.0  testKernel(int *)

Checking with Nsight Compute, I see the latter writes a lot more to device global memory (the baseline is the version without cuMemAdvise):

I checked that other atomic functions like add/sub/max only slow down around 30%, but for some reason atomicCAS slows down by a factor of >50x.

Is this expected at all that advising the driver about managed memory being used mainly from device slows down atomic operations on that memory from a kernel running on that device? Why is atomicCAS affected so drastically?

Robert_Crovella · July 17, 2024, 4:58pm

I won’t be able to offer a precise explanation, because AFAIK NVIDIA does not document the precise mechanism for atomics, including system atomics. But the behavior does not surprise me.

set accessed by is the wrong hint to use (evidently).

It is a hint that says, roughly:

“Leave the data where it is, and build a mapping so that this GPU can access that data over a relevant link (e.g. PCIE or NVLink).”

We can argue over that word-by-word, perhaps, but the description explicitly says:

This advice does not cause data migration and has no impact on the location of the data per se.

It also explicitly says its purpose is to build a “mapping” so that the referenced GPU can access it wherever it may be.

So what that means is that instead of migrating the data to the processor of interest (the GPU), the existence of the map and hint means that the GPU can access the data “wherever it is” and there is no specific requirement or impetus for the managed memory (MM) system to migrate the data.

If the data is migrated, then the atomic operation would be performed on the data in “local” device memory, pretty much like any other atomic on non-managed device memory.

If data is not migrated, the the atomic has to be done as a system atomic, which is done by generating some sort of sequence of bus cycles on the relevant bus (PCIE or NVLink). This is evidently and sensibly going to be slower.

Without that hint, the typical behavior of the MM subsystem is to migrate the data to the processor that touches it.

With the hint, the MM subsystem has the option to not migrate the data.

You could probably avoid the slowdown, while providing that hint, by explicitly migrating the data (cuMemPrefetch) before doing the atomics.

Anyway, although the actual behavior of the MM system is not specified to this level of detail (for valid reasons) the result is that the MM system has options, and you have specifically enabled an option that results in slower behavior.

In that sense, I would say it is “expected” behavior, or at least “plausible” and “valid” behavior.

atomicCAS is evidently/probably a special case. I don’t know the details, but system level atomics like atomicAdd may have a bus-driven variant that still roughly happens in a “single” bus cycle. My guess would be that the atomicCAS implementation probably has some sequence that it needs to follow on the bus, rather than a single cycle. That is just pure guesswork.

Or maybe migration eventually happens in the non-atomicCAS case(s), resulting in a lower performance hit, but the atomicCAS case doesn’t migrate the data as quickly.

Unit 6 of this online training series gives an overview of MM.

rafal.bielski · July 17, 2024, 5:36pm

Thank you for the prompt reply! Your explanation makes sense, I have clearly misunderstood the memAdvise documentation and incorrectly thought “implies that the data will be accessed by device” means the access will be more optimal with the hint. I was originally looking to solve a different performance problem - frequent page faults when reading managed memory on the device in small chunks - by hinting the driver to prefetch the data more eagerly onto the device. Thanks to your reply I now realised this advice is explicitly not going to do what I was looking for, and it can actually hold the data on the host. I’ll need to study the documentation and the training materials more carefully!

system · July 31, 2024, 5:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Accessing Managed Memory During Asynchronous Copies CUDA Programming and Performance	4	415	March 4, 2024
Unified virtual memory slowdown even without migration CUDA Programming and Performance	6	1526	January 31, 2022
Unified memory has slow bandwidth over NVLink 2.0 for large data sizes CUDA Programming and Performance	4	1077	September 25, 2019
cudaOccupancyAvailableDynamicSMemPerBlock returning incorrect value CUDA Programming and Performance hw , cuda , kernel , nvbugs	12	73	April 9, 2025
Significant performance problem with Unified Memory based on driver version CUDA Programming and Performance	2	1417	July 31, 2018
Global arrays? CUDA Programming and Performance	24	10627	August 18, 2010
Abysmal performance with Unified Memory and CUBLAS CUDA Programming and Performance	15	4293	November 29, 2014
Simplifying GPU Application Development with Heterogeneous Memory Management Technical Blog	0	395	August 22, 2023
Pascal & capabilities 6.0 show cudaDevAttrConcurrentManagedAccess is 0 CUDA Programming and Performance	15	1375	December 27, 2018
Problems with BFS implementation based on atomicCAS and atomicAdd CUDA Programming and Performance	11	1435	January 28, 2020

atomicCAS poor performance with managed memory after cuMemAdvise

Related topics