Access memory of cudaMallocManaged after launch kernel will cause crash

#include <iostream>

#define CUDA_RUNTIME_CHECK(call)                                                           \
  {                                                                                        \
    cudaError_t res = (call);                                                              \
    if (res != cudaSuccess) {                                                              \
      std::cout  << "CUDA Runtime API " << cudaGetErrorName(res) << ": "                   \
                 << cudaGetErrorString(res) << " [" << __FILE__ << ":" << __LINE__ << ']'; \
      exit(1);                                                                             \
    }                                                                                      \

__global__ void kernel_empty_test() {}

int main() {
  void* brr = nullptr;
  CUDA_RUNTIME_CHECK(cudaMallocManaged(&brr, 88));

  std::cout << brr << std::endl;

  memset(brr, 0, 88);

  // Remove this line, then the program won't crash.
  kernel_empty_test<<<1, 1, 0>>>();

  const int32_t* data = static_cast<int32_t*>(brr);
  std::cout << data[0];
nvcc -o rua && ./rua
[1]    28250 segmentation fault (core dumped)  ./rua

Linux 5.10.104-tegra
Inside docker image: l4t-ml:r35.2.1-py3
nvcc 11.4


Could you try to add a synchronization call after launching the kernel?



It works to add a synchronization before access the data, but the program still crashes if I only add a synchronization after the memory access. Do you think it’s expected?

I wanted to launch a kernel, then do some cpu computation on unified memory, so the cpu and gpu computation could be parallel. I think the object accessed by cpu won’t be used by gpu, so I don’t need to sync and the computation could be parallel. Did I misunderstand something?

Thanks for your help!


Jetson doesn’t support concurrent access so you will need to make sure the GPU tasks are done before accessing with CPU.

But in your use case, the kernel is actually doing nothing so does look strange.
We need to discuss this with our internal team. Will let you know the following later.



We got some feedback from our internal team.
This issue can be fixed by attaching the buffer with cudaMemAttachHost.

@@ -15,6 +15,7 @@ __global__ void kernel_empty_test() {}
 int main() {
   void* brr = nullptr;
   CUDA_RUNTIME_CHECK(cudaMallocManaged(&brr, 88));
+  CUDA_RUNTIME_CHECK(cudaStreamAttachMemAsync(NULL, brr, 0, cudaMemAttachHost));
   std::cout << brr << std::endl;

Since cuda driver does not know if a GPU kernel will access the memory or not.
It has to assume all possible managed memory might be used and apply the necessary protection which causes the segmentation fault here.


1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.