Crash when performing atomic subtraction on page-locked memory

Hello,

I’m using Jetpack 3.1 on a TX2 to develop a GPU-accelerated application. I noticed that I cannot perform the GCC atomic function __atomic_sub_fetch on memory that was allocated with cudaHostAlloc. The following test case always causes the application to hang:

#include <cuda.h>

int main()
{
    int* x;

    cudaHostAlloc(&x, 64, cudaHostAllocMapped);

    __atomic_sub_fetch(&x[0], 1, __ATOMIC_RELAXED);

    //never gets here
    return 0;
}

After executing this code, the following messages are visible in the kernel log (dmesg)

[ 1156.502590] CPU0: SError detected, daif=140, spsr=0x60000000, mpidr=80000100, esr=bf000002
...

This is merely a bug report. Please let me know if this bug report should be posted elsewhere.

Thanks.

Hi,

Thanks for your feedback.
To help us narrow down the issue, could you try if this error occurs with unified memory?

Thanks.

Example code:

#include <cuda.h>
#include <stdio.h>

int main()
{
    int* x;

#if HOST_ALLOC
    if (cudaHostAlloc(&x, 64, cudaHostAllocMapped) != cudaSuccess) return -1;
#else
    if (cudaMallocManaged(&x, 64, CU_MEM_ATTACH_GLOBAL) != cudaSuccess) return -1;
#endif

    if (cudaDeviceSynchronize() != cudaSuccess) return -1;

    x[0] = 123;

    __atomic_sub_fetch(&x[0], 1, __ATOMIC_RELAXED);

    printf("value: %u\n", x[0]);

    //never gets here
    return 0;
}

This code works properly when built and run with:

nvcc -gencode arch=compute_62,code=sm_62 test.cu -DHOST_ALLOC=0 && ./a.out

It hangs when built and run with:

nvcc -gencode arch=compute_62,code=sm_62 test.cu -DHOST_ALLOC=1 && ./a.out

Hi,

Thanks for your report.

We can reproduce this issue and pass the information to the internal team already.
Will update comment to you once we have a conclusion.

Thanks.

Hi,

This is a known issue.
The CPU atomics on the ARM are implemented as load linked/store conditional and won’t work on WC/UC memory.

We will check if this issue is documented.
Thanks.

Hi,

Thanks for getting back to me. We will work around this limitation.

Thanks again,

Hi, i had similar error when i used cudaHostAlloc with custom allocator and shared_ptr. Could you give a link to the document that describes this problem?

Hi, ghoogewerf, did you manage to solve the problem?

Hi Survial53,

Yes, we have managed to work around this limitation. I can’t share too many details, because the fix was incorporated in proprietary libraries. We use array/image objects with reference counting, implemented via atomics. The gist of it is that our array/image objects are using a different memory allocation for their data VS their headers. The data allocation could be CUDA device memory or managed memory, but the header allocation will always be done via a host heap (new/malloc).

I can’t share more than this. Hope it helps.

Thanks so much for fast answer. This technique very similar on opencv cv::Mat implementation. It looks like the surest way.