Is atomicCAS() still atomic for managed variable (var using unified memory)?

I have faced a problem where atomicCAS() doesn’t seem to be atomic. The abstract problem is described below.

Say 2 GPUs are going to update the variable “var” with its gpuID (0 and 1), the variable is using unified memory and initialized to 999. They update it using:

atomicCAS(&var, 999, gpuId);

Given that “var” is using advise “cudaMemAdviseSetAccessedBy” to one of the GPUs.

Problem: “var” has been updated 2 times.

why wouldn’t var be updated two times if two GPUs are updating it?

  1. If GPU1 updated var atomically, then var should be 0, and GPU2 will fail to update var since it doesn’t equal to 999.
  2. If GPU2 updated var atomically, then var should be 1, and GPU1 will fail to update var since it doesn’t equal to 999.

Is that correct?

They will both do an atomic operation. One will pass the compare test, one will fail it.

That is also what I expect :-). I’ve created a small demo of the problem.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdint.h>
#include <stdio.h>
#include <iostream>
#include <math.h>

__managed__ uint32_t var;
__managed__ bool updated[2];

#define CUDA_CHECK_RETURN(value) {											\
	cudaError_t _m_cudaStat = value;										\
	if (_m_cudaStat != cudaSuccess) {										\
		fprintf(stderr, "Error %s at line %d in file %s\n",					\
				cudaGetErrorString(_m_cudaStat), __LINE__, __FILE__);		\
		exit(1);															\
	} }

__global__ void update(uint32_t gpuId)
{
	if (atomicCAS(&var, 999, gpuId) == 999) //true -> success
	{
		updated[gpuId] = true;
	}
}

__global__ void init()
{
	var = 999;
	updated[0] = false;
	updated[1] = false;
}

int main(void)
{
	CUDA_CHECK_RETURN(cudaMemAdvise(&var, sizeof(uint32_t), cudaMemAdviseSetAccessedBy, 1))
	CUDA_CHECK_RETURN(cudaMemAdvise(&var, sizeof(uint32_t), cudaMemAdviseSetPreferredLocation, 0))

	for (int i = 0; i < 600; ++i)
	{
		cudaSetDevice(0);
		init<<<1,1>>>();	
		cudaSetDevice(0);
		update<<<1,1>>>(0);
		cudaSetDevice(1);
		update<<<1,1>>>(1);

		cudaDeviceSynchronize();
		if (updated[0] == true && updated[1] == true) {
			printf("GPU1: %d, GPU2: %d, iteration: %d\n", updated[0], updated[1], i);
		}
	}
	return 0;
}

I compiled the code with

nvcc -arch=sm_60 -lcudart -lcuda -o cas cas.cu

And my system info:
OS: Linux Ubuntu 16.04 64-bit
GPUs: 2x GTX 1080
CUDA & toolkit version: 9.1

If both GPUs have updated var, there will be a “printf”. And the chance of “printf” will be much lower if you comment out those two advise. Each launch of the application results in various of “printf”, and without that two adivise, you may need to launch the application for more than 20 times to see a “printf”.

Could you find out where the problem is? Thanks.

You should be using atomicCAS_system to guarantee atomicity between GPUs:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

I would also recommend compiling for -arch=sm_61 to match your GPU architecture, but that is aside from the issue here.

Thank you very much! That is exactly what I need.