unified memory with CUDA 8

Hello, developers!
I am trying to test new feature of unified memory in CUDA 8. My system has GTX 1070 and window 7. I just followed instruction in https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed/, but I am not sure if it actually works with my system. According to the web page, it looks like I do not need to use cudaMallocManaged() for memory allocation in CUDA 8, but is there any other setting I need to do?

You’ll note in the link you provided that it indicates that the “new feature” (demand-paged system memory) requires operating system support.

This feature is not available on windows (any version) and requires specific linux kernels with specific patches. I won’t be able to give a detailed recipe here.

Hi all,

@txbob, So once we OS support, can we use demand-paged system memory on non-pascal cards ?

At the moment, I have OS support at Power 8 machine (ubuntu 16 and tesla k40). I am trying to run following example, But I am having error.

__global__ void mykernel(char *data) {
  data[1] = 'g';
}

int main() {
   char *data = malloc(sizeof(char)*10); 
   mykernel<<<1024,1024>>>(data);
   // no synchronize here   
   data[0] = 'c';   
   return 0;
}

My previous comment was not really accurate.

CUDA 8 demand-paged unified memory (UM) including with memory oversubscription is supported as-is in CUDA 8, as long as you are using a supported environment for UM in CUDA 8. This requires a Pascal GPU but does not require any specific OS patches or the like.

The above behavior will require the use of the managed allocator, e.g. cudaMallocManaged.

What the OS patch (or new linux kernel) will enable (on linux only, of course) is the possibility for the above behavior even if a system allocator is used, e.g. malloc

(kewl)

This particular wrinkle (use of system allocator/kernel patch) is not something I have a full recipe for yet.

There is no support for demand-paging or any of the above additions to UM when a pre-Pascal GPU is used.

Sorry for previous misinformation/lack of clarity.

Hi @txbob, Thank you for useful information. They are all so enlightening for me. I think, at the moment many things about CUDA UMA are more or less unclear, but I’m pretty sure users will figure out soon :)

In terms of CUDA 8.0 and Kepler devices – As I understand, the code I provided above is not going to work through. However, afaik, CUDA 8.0 has another feature that it allows access to same pointer from device and host simultaneously without requiring cudaDeviceSynchronize. So, I am also wondering that, when we use cudaMallocManaged with CUDA 8.0 on Kepler devices, Do we need to synchronization ? I tried to illustrate example below using simple multi thread application.

__global__ void mykernel(char *data) {
  data[1] = 'g';
}

int main() {
   char *data;
   cudaMallocManaged(&data, 10);  
 
   #pragma omp parallel{
       tid = omp_get_thread_num();

       if(tid%2==0){
           mykernel<<<1024,1024>>>(data);
           // no synchronize here   
       }else  {
          data[0] = 'c';   
       }
   }
   return 0;
}

CUDA 8 doesn’t fundamentally change UM behavior or limitations on Kepler devices

synchronization is required in the case you have shown, so I would expect a seg fault when you try to touch data[0] in host code after the kernel launch, without synchronization.

This is documented in the programming guide in the UM section if you care to look.

Furthermore there are various CUDA sample codes that demonstrate proper use of UM.

Hi,

I have two Quadro k420s. Can I try unified memory on windows 10(updated)?

Can I convert a host pointer (probably only registered in CUDA for pinning) into a unified memory region in both Windows-10 and Linux 16.04, without using CUDA-side malloc?(for example, for CUDA-accelerating a library that gives its own host pointer)

Thank you.

UM doesn’t allow you to convert host pointers (already allocated) to managed pointers.

Also, UM on pre-pascal devices does not work in a demand-paging mode.