cudaMallocManaged with cudaMemAttachHost

user54598 · October 10, 2022, 9:08am

As the Cuda document says, “It is not permitted for the CPU to access any managed allocations or variables while the GPU is active for devices with concurrentManagedAccess property set to 0. On these systems concurrent CPU/GPU accesses, even to different managed memory allocations, will cause a segmentation fault because the page is considered inaccessible to the CPU”.

I have a simple test code, memory is allocated with cudaMallocManaged and cudaMemAttachHost, CPU and GPU can concurrent access the managed memory and the code works fine。
so, why not this test code case segmentation fault ? Is it correct to use cudaMallocManaged like that?

#include <stdio.h>
#include <thread>
#include <unistd.h>

// error checking macro
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


const int DSIZE = 4096;
const int block_size = 256;  // CUDA maximum is 1024
// vector add kernel: C = A + B
__global__ void vadd(const float *A, const float *B, float *C, int ds){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < ds)
    C[idx] = A[idx] + B[idx];
}

float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;

void loop1 () {
  cudaStream_t stream;
  cudaStreamCreate(&stream);
  while(1) {
    vadd<<<(DSIZE+block_size-1)/block_size, block_size,1>>>(h_A, h_B, h_C, DSIZE);
    cudaCheckErrors("kernel launch failure");
    printf("A[0] = %f\n", h_A[0]);
    printf("B[0] = %f\n", h_B[0]);
    printf("C[0] = %f\n", h_C[0]);
  }
}

int main(){

  cudaMallocManaged(&h_A,DSIZE*sizeof(float),cudaMemAttachHost);
  cudaMallocManaged(&h_B,DSIZE*sizeof(float),cudaMemAttachHost);
  cudaMallocManaged(&h_C,DSIZE*sizeof(float),cudaMemAttachHost);

  cudaMallocManaged(&d_A,DSIZE*sizeof(float),cudaMemAttachHost);
  cudaMallocManaged(&d_B,DSIZE*sizeof(float),cudaMemAttachHost);
  
  cudaCheckErrors("cudaMallocManaged fail");
  for (int i = 0; i < DSIZE; i++){
    h_A[i] = rand()/(float)RAND_MAX;
    h_B[i] = rand()/(float)RAND_MAX;
    h_C[i] = 0;}
  
  std::thread thr(loop1);
  sleep(1);
  while(1) {
    d_A[0] ++;
    d_B[0] ++;
    printf("---- A[0] = %f\n", h_A[0]);
    printf("---- B[0] = %f\n", h_B[0]);
    printf("---- C[0] = %f\n", h_C[0]);
    printf("---- A[0] = %f\n", d_A[0]);
    printf("---- B[0] = %f\n", d_B[0]);
  }

  while(1) {
	  sleep(1000);
  }
  return 0;
}

neel_patel · October 13, 2022, 12:58am

Hi ,
It seems the reason you’re getting a behavior that seems different than the usual “seg fault” behavior is that passing the cudaMemAttachHost flag to cudaMallocManaged changes the default behavior of an allocation. The default behavior is (automatic) migratable. But when you pass that particular flag, the default behavior is not migratable. Since it is not migratable, you don’t get a seg fault (it is still accessible from host code), but it is not accessible from device code until you do something specific, even if you launch a kernel. You can read more about that in the runtime API description of cudaMallocManaged here. So the code is illegal in the sense that it has not allowed an allocation to migrate to the GPU, before attempting to use that allocation on the GPU. If it did allow that migration, you would witness a seg fault.

here is the relevant quote: “If cudaMemAttachHost is specified, then the allocation should not be accessed from devices that have a zero value for the device attribute cudaDevAttrConcurrentManagedAccess; an explicit call to cudaStreamAttachMemAsync will be required to enable access on such devices.” Your code never makes that call. And its not a simple matter of making that call (properly), you would also have to add in the cudaDeviceSynchronize() that we normally expect to see, to make non-illegal code.

system · November 2, 2022, 4:32am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Access memory of cudaMallocManaged after launch kernel will cause crash Jetson AGX Orin cuda	5	528	December 5, 2023
cudaMallocManaged() clarification needed CUDA Programming and Performance	5	11487	November 20, 2018
Cuda memory access with cudaMallocManaged CUDA Programming and Performance camera , cuda	3	466	September 11, 2024
What exactly does the managed memory flag do and what changes? CUDA Programming and Performance	5	1320	January 12, 2022
Get Segmentation fault in CUDA online course: Exercise: Array Manipulation on both the Host and Device Teaching & Curriculum Support	1	837	December 15, 2018
Difference between cudaMallocManaged and cudaMallocHost CUDA Programming and Performance cuda	3	15314	March 30, 2022
Unified memory and concurrent C++ objects Jetson TX2	10	2676	October 18, 2021
Managed Memory Access crash on Tegra CUDA Programming and Performance cuda , jetson	2	1063	March 9, 2022
cudaMallocManaged do not allocate on shared VRAM, but on dedicated VRAM CUDA Programming and Performance	2	549	October 12, 2021
concurrentManagedAccess = 0 CUDA Programming and Performance	0	900	July 11, 2017

cudaMallocManaged with cudaMemAttachHost

Related topics