cudaMallocManaged with cudaMemAttachHost

As the Cuda document says, “It is not permitted for the CPU to access any managed allocations or variables while the GPU is active for devices with concurrentManagedAccess property set to 0. On these systems concurrent CPU/GPU accesses, even to different managed memory allocations, will cause a segmentation fault because the page is considered inaccessible to the CPU”.

I have a simple test code, memory is allocated with cudaMallocManaged and cudaMemAttachHost, CPU and GPU can concurrent access the managed memory and the code works fine。
so, why not this test code case segmentation fault ? Is it correct to use cudaMallocManaged like that?

#include <stdio.h>
#include <thread>
#include <unistd.h>

// error checking macro
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

const int DSIZE = 4096;
const int block_size = 256;  // CUDA maximum is 1024
// vector add kernel: C = A + B
__global__ void vadd(const float *A, const float *B, float *C, int ds){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < ds)
    C[idx] = A[idx] + B[idx];

float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;

void loop1 () {
  cudaStream_t stream;
  while(1) {
    vadd<<<(DSIZE+block_size-1)/block_size, block_size,1>>>(h_A, h_B, h_C, DSIZE);
    cudaCheckErrors("kernel launch failure");
    printf("A[0] = %f\n", h_A[0]);
    printf("B[0] = %f\n", h_B[0]);
    printf("C[0] = %f\n", h_C[0]);

int main(){


  cudaCheckErrors("cudaMallocManaged fail");
  for (int i = 0; i < DSIZE; i++){
    h_A[i] = rand()/(float)RAND_MAX;
    h_B[i] = rand()/(float)RAND_MAX;
    h_C[i] = 0;}
  std::thread thr(loop1);
  while(1) {
    d_A[0] ++;
    d_B[0] ++;
    printf("---- A[0] = %f\n", h_A[0]);
    printf("---- B[0] = %f\n", h_B[0]);
    printf("---- C[0] = %f\n", h_C[0]);
    printf("---- A[0] = %f\n", d_A[0]);
    printf("---- B[0] = %f\n", d_B[0]);

  while(1) {
  return 0;

Hi ,
It seems the reason you’re getting a behavior that seems different than the usual “seg fault” behavior is that passing the cudaMemAttachHost flag to cudaMallocManaged changes the default behavior of an allocation. The default behavior is (automatic) migratable. But when you pass that particular flag, the default behavior is not migratable. Since it is not migratable, you don’t get a seg fault (it is still accessible from host code), but it is not accessible from device code until you do something specific, even if you launch a kernel. You can read more about that in the runtime API description of cudaMallocManaged here. So the code is illegal in the sense that it has not allowed an allocation to migrate to the GPU, before attempting to use that allocation on the GPU. If it did allow that migration, you would witness a seg fault.

here is the relevant quote: “If cudaMemAttachHost is specified, then the allocation should not be accessed from devices that have a zero value for the device attribute cudaDevAttrConcurrentManagedAccess; an explicit call to cudaStreamAttachMemAsync will be required to enable access on such devices.” Your code never makes that call. And its not a simple matter of making that call (properly), you would also have to add in the cudaDeviceSynchronize() that we normally expect to see, to make non-illegal code.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.