Segmentation fault when calling virtual function on host

Here is a simple example:

class Base {
public:
	virtual void disp() {
		printf("base\n");
	}
};

class Derived : public Base {
public:
	void disp() {
		printf("derived\n");
	}
};

int main() {
	Derived *p_derived;
	Base *p_base;

	cudaMallocManaged(&p_derived, sizeof(Derived));

	p_derived->disp();
	p_base = p_derived;
	p_base->disp();

	cudaFree(p_derived);

	return 0;
}

This example does not even involve global function. It is purely on host side. Yet it gives segmentation fault error. If I replace the line “cudaMallocManaged(&p_derived, sizeof(Derived));” with “p_derived = new Derived();”, then everything is OK. According to CUDA document (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-opt-in), there should be no difference whether using “new” or “cudaMallocManaged”.

The CUDA installed on my server is CUDA 10.0.130. Device capability is 7.0. The OS is CentOS Linux release 7.6.1810, and the kernel version is 3.10.0-957.el7.x86_64. The administrators when installed CUDA on my server is not very familiar with CUDA, so they have no idea.

Can anyone please help? Thank you.

Your cudaMallocManaged call is allocating empty space, where the size of that space is equal to the size of the class. That is not the same as calling new on the class. Calling new on the class will initiate the class constructor, whereas cudaMallocManaged doesn’t do anything like that.

The doc link you provided doesn’t suggest that new and cudaMallocManaged are equivalent. If you think it does, let’s discuss the specific section that you think states this: quote the text please.

The virtual function pointer table is part of the object itself, so it must be initialized for proper use.

One way to “fix” your code would be something like this:

#include <stdio.h>

class Base {
public:
        virtual void disp() {
                printf("base\n");
        }
};

class Derived : public Base {
public:
        void disp() {
                printf("derived\n");
        }
};

int main() {
        Derived *p_derived;
        Base *p_base;
        Derived x;
        cudaMallocManaged(&p_derived, sizeof(Derived));
        memcpy(p_derived, &x, sizeof(Derived));
        p_derived->disp();
        p_base = p_derived;
        p_base->disp();

        cudaFree(p_derived);

        return 0;
}

Although its not the focus of your question, you may wish to be aware that objects with virtual functions cannot be used reliably in device code if they were allocated/initialized in host code, and then passed from host to device:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#virtual-functions

“If an object is created in host code, invoking a virtual function for that object in device code has undefined behavior.”

Thank you very much for your quick reply! I was aware that cudaMallocManaged differs from “new” in that the data member of the class object is not initialized, but I did not realized this issue causes the segfault error when virtual function is called even only on host.

The excerpt from “Programming Guide :: CUDA Toolkit Documentation” that confuses me is this:

“Starting with CUDA 8.0 and on supporting systems with devices of compute capability 6.x, memory allocated with the default OS allocator (e.g. malloc or new) can be accessed from both GPU code and CPU code using the same pointer. On these systems, Unified Memory is the default: there is no need to use a special allocator or the creation of a specially managed memory pool.”

Now I see that I misunderstood it. Thank you for pointing out.

Since we have reached this topic, I have another question regarding the difference between cudaMallocManaged and new. Here is another code snippet:

class Simpclass {
private:
	int m = 8;

public:
	__device__ __host__ int get_num() {
		return m;
	}
};

__global__
void simp_kernel_func(Simpclass *p) {
	printf("kernel func called\n");
	printf("m = %d\n", p->get_num());
	printf("m should have been displayed above\n");
	return;
}

int main() {
	Simpclass *p;
	p = new Simpclass();

	simp_kernel_func<<<1, 1 >>> (p);
	cudaDeviceSynchronize();

	cout << "after kernel: " << p->get_num() << endl;
	delete p;
	return 0;
}

I was hoping the result would be something like this:

kernel func called
m = 8
m should have been displayed above
after kernel: 8

However, what I got is:

kernel func called
after kernel: 8

Apparently, the kernel function terminated early without giving out any warning message. If I replace “new” with “cudaMallocManaged”, and replace “delete” with “cudaFree”, I get the result as this:

kernel func called
m = 0
m should have been displayed above
after kernel: 0

This result is expected, since cudaMallocManaged does not initialize the data member (i.e. int m = 8;) of the class object. But I do not understand why the kernel function aborts early when “new” is used to allocate the unified memory.

Can you please help? Thank you.

P.S. Thank you for reminding me of the CUDA restrictions on virtual functions. So if I pass a pointer to a class object to the kernel function, but do not invoke the virtual function on device, is that still allowed?

In fact that excerpt, with particular attention to on supporting systems, applies most directly to systems with the linux HMM kernel/patch. That is probably not your system. It also has some relevance to IBM Power9/Volta systems, which use an alternate hardware technique to provide a similar capability.

So unless you fall into one of those two categories, it does not apply to you. (Furthermore it does not equate new and cudaMallocManaged).

Would you like a warning message? If so use proper CUDA error checking, and/or run your code with cuda-memcheck. Errors are being reported through the API: you are ignoring them. Not sure what “proper CUDA error checking” is? Google it, take the first hit, and start reading, and apply it to your code.

The fundamental problem with your code is basic. You have allocated a host pointer:

Simpclass *p
p = new Simpclass();

p points to an area in host memory. That pointer, as well as anything it points to, is not usable in CUDA device code in any way. That is a really fundamental CUDA programming concept: host pointers cannot be dereferenced in device code, and device pointers cannot be dereferenced in host code.

So it’s no surprise that when you attempt to dereference a host pointer in device code:

printf("m = %d\n", p->get_num());
                    ^
                    dereferencing a host pointer in device code

the kernel execution fails.

new (ordinarily) does not allocate unified memory. Unless you do something special, it is a host based allocator; CUDA doesn’t change that.

new can be used to allocate unified memory, but to do so you need to overload the new operator (ordinary C++ overload). Such an overload can be used for classes you define. It cannot be used to overload new behavior on POD (language built-in) data types.

https://devblogs.nvidia.com/unified-memory-in-cuda-6/

So I see that on my platform, “new” allocates host memory, while “cudaMallocManaged” allocates unified memory. I cannot treat them equally. Thank you so much for your help!

Hi Robert,

I noticed an interesting fact with the example you gave me previously:

#include <stdio.h>

class Base {
public:
        virtual void disp() {
                printf("base\n");
        }
};

class Derived : public Base {
public:
        void disp() {
                printf("derived\n");
        }
};

int main() {
        Derived *p_derived;
        Base *p_base;
        Derived x;
        cudaMallocManaged(&p_derived, sizeof(Derived));
        memcpy(p_derived, &x, sizeof(Derived));
        p_derived->disp();
        p_base = p_derived;
        p_base->disp();

        cudaFree(p_derived);

        return 0;
}

If I replace line 20 with

Derived *p_x;
p_x = new Derived();

and line 22 with

memcpy(p_derived, p_x, sizeof(Derived));

respectively, the program also runs OK even with cuda-memcheck. But apparently, here I only “new” a pointer without “delete” it later. Moreover, if I include

delete p_x;

later in the code, there is still no error when running with cuda-memcheck.

I came to notice this when I programmed with my real code, which is too large to share on this forum. I experimented with “new” a pointer before using memcpy to initialize the variable in the unified memory, then I deleted the newly created pointer in hope to free up some memory. However, I got segmentation fault error. If I did not delete the pointer, then the program worked fine.

I guess it might be because memcpy somehow connects some CPU memory with GPU memory. But it seems to be contrary to the my understanding of “memory copy”, which should simply copy, but not connect.

Can you please enlighten me as to what actually happens in the background by CUDA? Thank you.

I would expect these cases to be equivalent. You are indicating they behave in an equivalent fashion. I’m not sure what there is to explain.

This:

Derived x;

and this:

p_x = new Derived();

both call the constructor. The constructed objects are equivalent for the purpose of this discussion. Therefore the behavior using these constructed objects is the same.

The concepts we are discussing now are just C++ concepts. Nothing in this particular response of mine has anything to do with CUDA.

Hi, thank you for your quick reply. Sorry if I did not make my question clear. What confuses me is that in theory, if one “new” an object, then he/she must “delete” it explicitly, otherwise there will be memory leak. However, my simple example above seems to show the “delete” operation in CUDA program is optional, since either way, the program runs OK. Moreover, even cuda-memcheck does not report any error even if there is only “new” operation and no “delete” operation.

Moreover, when I converted some CPU legacy code to CUDA, I worked with a complex class A, and I initialized the variable in unified memory like this:

A *p_UnifiedMemory, *p_CPU;
cudaMallocManaged(&p_UnifiedMemory, sizeof(A));
p_CPU = new A();
memcpy(p_UnifiedMemory, p_CPU, sizeof(A));

However, after that, if I wanted to free up the CPU memory by:

delete p_CPU;

I would have the segmentation fault error. So here, I am bewildered by two things:

  1. A “new” operation should come with a “delete” operation later, so there ought to be nothing wrong with “delete p_CPU”.

  2. The “memcpy()” function only makes an extra copy from the source variable to the destination variable, therefore after memcpy, it should be OK to delete the source, and the destination variable should not be affected. Then where does the segfault error come from? My guess is that the memcpy function in a CUDA program does NOT perform a complete data copy operation for the complex class A. Instead, it leaves some data in the source variable and some data in the destination variable inter-connected. However, I have not found such info in CUDA documents.

You can step through your program using a debugger to find the error.

One problem could arise if A has pointer(s) to host memory. The pointers are being copied, but cannot be used in device code since they point to host memory.

cuda-memcheck using the default tool doesn’t report any category of errors involving missing host memory delete, or host memory leaks. (the –leak-check full option does report leaks in 2 categories: cudaMalloc and in-kernel use of malloc, but it doesn’t indicate anything about managed memory, i.e. cudaMallocManaged, or any ordinary host memory allocations via new or malloc, which is what is in view here)

A program is not necessarily incorrect if a new operation does not have a corresponding delete.

If I intend to use an object for the entire life of the program, I need not delete it at any point, for correctness. There may be C++ style arguments that say you should delete it anyway right before exiting, but that is not required for correctness.

The behavior of various operating systems, including linux and windows, is that all ordinary allocations (malloc, new, std::vector, etc.) performed by a program (process) are released when the program (process) exits.

Again, this has nothing to do with CUDA, but CUDA behaves the same way, as it adheres to C++ and is running on either linux or windows.

For device allocations, CUDA also frees all device allocations when the host process owning those allocations exits.

Regarding the rest, I refer you to the comment above. I wouldn’t be able to debug an incomplete snippet of code for you. It would be important to know things like what the destructor looks like for A as well as scopes of variables. Basically, a short, complete code. Without that I often won’t even try.