unified memory with CUDA 8

dhseo · October 11, 2016, 7:01pm

Hello, developers!
I am trying to test new feature of unified memory in CUDA 8. My system has GTX 1070 and window 7. I just followed instruction in https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed/, but I am not sure if it actually works with my system. According to the web page, it looks like I do not need to use cudaMallocManaged() for memory allocation in CUDA 8, but is there any other setting I need to do?

Robert_Crovella · October 11, 2016, 7:13pm

You’ll note in the link you provided that it indicates that the “new feature” (demand-paged system memory) requires operating system support.

This feature is not available on windows (any version) and requires specific linux kernels with specific patches. I won’t be able to give a detailed recipe here.

grynet · October 20, 2016, 1:39pm

Hi all,

@txbob, So once we OS support, can we use demand-paged system memory on non-pascal cards ?

At the moment, I have OS support at Power 8 machine (ubuntu 16 and tesla k40). I am trying to run following example, But I am having error.

__global__ void mykernel(char *data) {
  data[1] = 'g';
}

int main() {
   char *data = malloc(sizeof(char)*10); 
   mykernel<<<1024,1024>>>(data);
   // no synchronize here   
   data[0] = 'c';   
   return 0;
}

Robert_Crovella · October 20, 2016, 1:47pm

My previous comment was not really accurate.

CUDA 8 demand-paged unified memory (UM) including with memory oversubscription is supported as-is in CUDA 8, as long as you are using a supported environment for UM in CUDA 8. This requires a Pascal GPU but does not require any specific OS patches or the like.

The above behavior will require the use of the managed allocator, e.g. cudaMallocManaged.

What the OS patch (or new linux kernel) will enable (on linux only, of course) is the possibility for the above behavior even if a system allocator is used, e.g. malloc

(kewl)

This particular wrinkle (use of system allocator/kernel patch) is not something I have a full recipe for yet.

There is no support for demand-paging or any of the above additions to UM when a pre-Pascal GPU is used.

Sorry for previous misinformation/lack of clarity.

grynet · October 20, 2016, 2:05pm

Hi @txbob, Thank you for useful information. They are all so enlightening for me. I think, at the moment many things about CUDA UMA are more or less unclear, but I’m pretty sure users will figure out soon :)

In terms of CUDA 8.0 and Kepler devices – As I understand, the code I provided above is not going to work through. However, afaik, CUDA 8.0 has another feature that it allows access to same pointer from device and host simultaneously without requiring cudaDeviceSynchronize. So, I am also wondering that, when we use cudaMallocManaged with CUDA 8.0 on Kepler devices, Do we need to synchronization ? I tried to illustrate example below using simple multi thread application.

__global__ void mykernel(char *data) {
  data[1] = 'g';
}

int main() {
   char *data;
   cudaMallocManaged(&data, 10);  
 
   #pragma omp parallel{
       tid = omp_get_thread_num();

       if(tid%2==0){
           mykernel<<<1024,1024>>>(data);
           // no synchronize here   
       }else  {
          data[0] = 'c';   
       }
   }
   return 0;
}

Robert_Crovella · October 20, 2016, 3:32pm

CUDA 8 doesn’t fundamentally change UM behavior or limitations on Kepler devices

synchronization is required in the case you have shown, so I would expect a seg fault when you try to touch data[0] in host code after the kernel launch, without synchronization.

This is documented in the programming guide in the UM section if you care to look.

Furthermore there are various CUDA sample codes that demonstrate proper use of UM.

tugrul_192bit · April 1, 2018, 7:01pm

Hi,

I have two Quadro k420s. Can I try unified memory on windows 10(updated)?

Can I convert a host pointer (probably only registered in CUDA for pinning) into a unified memory region in both Windows-10 and Linux 16.04, without using CUDA-side malloc?(for example, for CUDA-accelerating a library that gives its own host pointer)

Thank you.

Robert_Crovella · April 2, 2018, 2:07pm

UM doesn’t allow you to convert host pointers (already allocated) to managed pointers.

Also, UM on pre-pascal devices does not work in a demand-paging mode.

Topic		Replies	Views
Does Pascal Unified Memory, mentioned in pascal whitepaper, supported now? CUDA Programming and Performance	9	1778	April 14, 2017
Unified memory for CC 6.1 CUDA Programming and Performance	4	1472	December 6, 2016
Using unified memory in GTX 1070 with CUDA 8 CUDA Programming and Performance	4	1562	October 29, 2016
cudaMallocManaged() clarification needed CUDA Programming and Performance	5	11514	November 20, 2018
Unified memory in Pascal architecture. CUDA Programming and Performance	1	757	August 4, 2017
Unified Memory for CUDA Beginners Technical Blog	46	3033	December 1, 2023
SM architecture 6.x additional Unified Memory (PeagableMemoryAccess and ConcurrentManaged Acess) support CUDA Programming and Performance	2	758	July 10, 2017
Unified Memory in CUDA 6 Technical Blog	87	2534	August 16, 2019
Problems with Unified Memory Under Pascal CUDA Programming and Performance	2	1133	January 24, 2017
Unified Memory On TX1 Jetson TX1	4	927	October 18, 2021

unified memory with CUDA 8

Related topics