__global__ void mykernel(int *addr) {
atomicAdd_system(addr, 10); // only available on devices with compute capability 6.x
}
void foo() {
int *addr;
cudaMallocManaged(&addr, 4);
*addr = 0;
mykernel<<<1,1,0>>>(addr);
__sync_fetch_and_add(addr, 10); // CPU atomic operation
}
This code returns me segmentation fault on the last line __sync_fetch_and_add(addr, 10);, when it tries to modify the unified memory in addr. I compiled it with -O2 -arch=sm_86 and I am running on a RTX 3070TI with CUDA12.1.
This fault can be fixed by adding a cudaDeviceSynchronize(); before calling addr. But I need to do some async operations. Is there any other flag I need to set to get it work?
I guess not. It seems that __sync_fetch_and_add() is gnu-specific.
Anyway I have no trouble running your code (on linux). I suspect a system setup issue. Try using proper CUDA error checking. Not sure what that is? Google “proper CUDA error checking”, and take the first hit, and apply it to your code.
If no errors are returned (apart from seg fault) then query the cudaDevAttrConcurrentManagedAccess property and print it out.
I am running on WSL2.
I queried the attribute, and it returns me 0, so it looks like concurrent access is not supported. I also queried directManagedMemAccessFromHost, which is also 0.
Is that because I am running in wsl2 or I am using a laptop GPU? The compute capability of 3070ti is beyond 6.0, so I am confused why it is not supported.
Yes, WSL2, even though it is a “linux-like” environment, has the same underlying access limitation as windows for Unified Memory, at the current time. Concurrent managed access is not possible. The limitation is documented here:
Unified Memory - Full Managed Memory Support is not available on Windows native and therefore WSL 2 will not support it for the foreseeable future.
Concurrent CPU/GPU access is not supported. CUDA queries will say whether it is supported or not and applications are expected to check this.