Unified memory and concurrent C++ objects

mescarra · March 13, 2018, 2:32pm

Hello everyone,

Here’s a very simple piece of code having concurrent C++ objects executing a kernel modifying a private variable, then synchronizing, then printing the result. In each object, the private variable is allocated using the cudaMallocManaged function. I’m compiling this code with:

/usr/local/cuda-8.0/bin/nvcc -std=c++11 -gencode arch=compute_62,code=sm_62 example.cu

and running it on a Jetson TX2.

I could not make this run on my Jetson TX1 for a good reason:

“Simultaneous access to managed memory from the CPU and GPUs of compute capability lower than 6.0 is not possible. This is because pre-Pascal GPUs lack hardware page faulting, so coherence can’t be guaranteed. On these GPUs, an access from the CPU while a kernel is running will cause a segmentation fault.”

But according to this same quote, this code should work on my TX2, but I also get segmentation faults.

What am I missing?

Thanks in advance!

#include <thread>
#include <vector>

#define SIZE 100
#define N_THREADS 10

__global__ void kernel(int *x) {
  int i = threadIdx.x;
  x[i] = x[i] + i;
}

class Foo 
{
 public:
  Foo(int id) 
  {
    cudaMallocManaged(&_bar, SIZE * sizeof(int));
    for (int i=0; i < SIZE; i++)
      _bar[i] = id; 
  }

  void work()
  {
    kernel<<<1, SIZE>>>(_bar);
    cudaDeviceSynchronize();
    printf("%d %d %d\n", _bar[0], _bar[1], _bar[2]);
  }

 private:
  int *_bar;
}; 

int main()
{
  int i;
  std::thread t[N_THREADS];
  std::vector<Foo*> f = {}; 
  
  for (i=0; i < N_THREADS; i++)
  {
    f.push_back(new Foo(i));  
  }

  for (i=0; i < N_THREADS; i++)
  {
    t[i] = std::thread([&](int n)
      {   
      f[n]->work();
      }, i); 
  }

  for (i=0; i < N_THREADS; i++)
  {
    t[i].join();
  }

  return 0;
}

AastaLLL · March 14, 2018, 9:38am

Hi,

Guess that this error is not caused by concurrent access.
We can reproduce the segmentation fault error even with no GPU r/w code:

__global__ void kernel() {
 // int i = threadIdx.x;
//  x[i] = x[i] + i;
}

class Foo
{
  ...
  void work()
  {
    kernel<<<1, SIZE>>>();
    cudaDeviceSynchronize();
    printf("%d %d %d\n", _bar[0], _bar[1], _bar[2]);
  }
  ...
};

We are discussing this internally and will update information to you later.
Thanks.

AastaLLL · March 23, 2018, 5:57am

Hi,

Sorry for the late reply.

After checking, we also find the behavior is strange when reading the unified memory.
And the failure rate will be much lower if we execute it with cuda-memcheck.

We have passed this issue to our internal CUDA team.
Will update information to you once we get further information.

Thanks.

mescarra · March 26, 2018, 1:06pm

Great!

I hope you guys can work this out without hurting performance, because unified memory is way faster than zero-copy memory (at least in our application).

ClancyLian · March 29, 2018, 1:51am

Hi,

Does the question is the same like my topic [url]https://devtalk.nvidia.com/default/topic/1028582/why-can-t-i-use-unified-memory-in-two-cpu-thread-/[/url]

Thanks!

AastaLLL · March 29, 2018, 6:31am

Hi, ClancyLian

Could you share your source to topic1028582 for us further checking?
https://devtalk.nvidia.com/default/topic/1028582/why-can-t-i-use-unified-memory-in-two-cpu-thread-/

Thanks.

mescarra · March 29, 2018, 11:37am

Yes ClanyLian, I haven’t seen your code, but it sounds very much like it.

ClancyLian · March 30, 2018, 1:39am

Hi, AastaLLL and mescarra

It just the same question ! When I set

N_THREADS 1

It would not happen segmentation faults.

And because your calculation in kernel is not complex, so it may not happen error with 2 threads or 3 threads or…

Thanks.

AastaLLL · March 30, 2018, 3:12am

Hi,

Thanks for the confirm.

This issue is already passed to our internal team for checking.
We will update information here if any finding.

AastaLLL · April 13, 2018, 6:30am

Update:

The source shared in comment #1 is incorrect.

It doesn’t attach managed memory to any stream so all the managed memory in the app is global. All threads are doing launch, synchronize and read. If one thread launches kernel while other is reading on CPU then it is bound to fail as the kernel launch would make all global managed memory inaccessible on CPU.

All these usages are well documented in the unified memory programming guide:
[url]Programming Guide :: CUDA Toolkit Documentation

Most importantly, if an allocation is not associated with a specific stream, it is visible to all running kernels regardless of their stream. This is the default visibility for a cudaMallocManaged() allocation or a managed variable; hence, the simple-case rule that the CPU may not touch the data while any kernel is running.

For using unified memory in multiple CPU threads, you should use multiple streams and attach the memory to stream using cudaStreamAttachMemAsync. Using per thread streams feature would also simplify the programming.

Thanks

Topic		Replies	Views
Unified Memory Access using Jetson TX2 Jetson TX2	5	2340	October 18, 2021
Unified Memory On TX1 Jetson TX1	4	858	October 18, 2021
Unified memory not working completely Jetson TX1	4	1412	October 18, 2021
Cuda memory access with cudaMallocManaged CUDA Programming and Performance camera , cuda	3	324	September 11, 2024
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1624	October 18, 2021
Access memory of cudaMallocManaged after launch kernel will cause crash Jetson AGX Orin cuda	5	415	December 5, 2023
cudaMallocManaged with cudaMemAttachHost Jetson AGX Orin cuda	2	552	October 13, 2022
Unified Memory not reachable by GPU? Jetson Nano cuda	3	631	November 17, 2021
Performance issues after refactoring CUDA code to avoid managed memory CUDA Programming and Performance jetson	5	86	November 19, 2024
Segmentation fault or bug error when use unified memory on jetson nx Jetson Nano cuda	4	497	April 12, 2023

Unified memory and concurrent C++ objects

Related topics