Buggy (?) compiler behavior with polymorphism in GPU device code

Dear forum,

I am currently about do develop some larger library for multi-GPU clusters. In order to allow users to add custom functionalities to the library, I started to include CUDA device code polymorphism into the code. By doing that, I run into “Hardware Stack Overflow” issues. To be able to present my problem in this forum, I created a “minimalistic” example (sorry, it is really the smallest possible one, even though it does not look like that), which currently runs on “Piz Daint”. That is, I use a combination of the Cray C++ compiler and CUDA 8.0. Moreover, the GPU is a Tesla P100 and I compile the code for compute capability 6.0.

The situation is as follows:

I am well aware of the fact, that you cannot pass objects of classes with virtual functions to any kernel. Therefore, for me, the currently only available way to implement my code is to follow the example in

https://stackoverflow.com/questions/26812913/how-to-implement-device-side-cuda-virtual-functions

at the end of the post.

In the following, I present my personal code which is only a very slight modification of this. I have five files (plus some headers, error checking which I do not include here):

main.cpp:

#include <cuda_runtime.h>
#include <stdlib.h>
        
#include "error_check.h"
#include "function_launcher.h"
#include "child.h"

int main(int argc, char* argv[])
{
        struct parent_class** c_d_p;
        cudaMalloc((void***)&c_d_p, sizeof(struct parent_class*));
 
        create_child_object(c_d_p);

        do_something(c_d_p);
        
        destroy_child_object(c_d_p);
        
        cudaFree(c_d_p);
}

It basically does the pointer allocation, creates the object on device, calls the virtual function and destroys the object on device.

parent.h:

class parent_class
{
        public:
                virtual __device__ double doit() =0;
};

This is a purely virtual parent class.

child.h:

#include "parent.h"

class child_class : public parent_class
{               
        public:

                __device__ double doit()
                {
                        return 1.0;
                }

};

extern void create_child_object(struct parent_class** c);

extern void destroy_child_object(struct parent_class** c);

This is the child class implementing “doit” plus some headers for the allocation + destruction of the object.

child.cu:

#include <stdio.h>
#include "error_check.h"

#include "parent.h"
#include "function_launcher.h"
#include "child.h"

__global__ void create_child_object_kernel(struct parent_class** c)
{
        (*c) = new child_class();
}

void create_child_object(struct parent_class** c)
{
        create_child_object_kernel<<<1,1>>>(c);
        cudaThreadSynchronize();
        checkCUDAError("create_child_class_object");
}

__global__ void destroy_child_object_kernel(struct parent_class** c)
{
        delete *c;
}

void destroy_child_object(struct parent_class** c)
{
        destroy_child_object_kernel<<<1,1>>>(c);
        cudaThreadSynchronize();
        checkCUDAError("destroy_child_class_object");
}

This is the implementation of the construction / destruction code for the device object.

function_launcher.cu:

#include <stdio.h>
#include "error_check.h"
#include "function_launcher.h"
#include "child.h"

__global__ void do_something_kernel(parent_class** p)
{
        double r= (*p)->doit();
}

void do_something(parent_class** p)
{
        do_something_kernel<<<1, 1>>>(p);
        cudaThreadSynchronize();
        checkCUDAError("do_something_kernel");

}

// __global__ void create_child_object_kernel(struct parent_class** c)
// {       
//         (*c) = new child_class();
// }
//         
// void create_child_object(struct parent_class** c)
// {       
//         create_child_object_kernel<<<1,1>>>(c);
//         cudaThreadSynchronize();
//         checkCUDAError("create_child_object");
// }

This function shall be part of a generic, binary library. Therefore, it is placed in a separate file. (Please ignore for now the commented out code…)

If I compile the above code (plus some header stuff), it compiles perfectly. However, when launching it, I get a hardware stack problem. cuda-memcheck e.g. sais:

========= Hardware Stack Overflow
=========     at 0x002a1450 in do_something_kernel(parent_class**)
=========     by thread (0,0,0) in block (0,0,0)
=========     Device Frame:do_something_kernel(parent_class**) (do_something_kernel(parent_class**) : 0x90)
=========     Device Frame:do_something_kernel(parent_class**) (do_something_kernel(parent_class**) : 0x90)

I hunted it down to the point that it seems that the device code wants to launch the doit function, however, it then goes to some undefined place. (At an earlier point of my tests I also got an “Illegal Instruction” error, when compiling with “-g”. However, this one dissapeared while reducing my code to this “minimalistic” example above.

However, if I move the “create_child_object” function + the appropriate kernel to the function_launcher.cu code (i.e. commenting it in in function_launcher.cu and commenting it out int he other file), then it works perfectly fine.

My point is that it seems that the compiler is only able to detect polymorphism if the “new child_class()” is in the same code file as the call to that function. However, this does not seem to be a documented compiler limitation.

Is this a bug or am I doing something wrong?

It would be great, if someone could give me a hint for this very, very technical question.
(… and thanks already for reading up to this point…)

Best regards,

Peter

are you compiling and linking with relocatable device code enabled?

You just fixed my problem!

I wasn’t aware of this relocation business (even though I should have been…).

Many, many thanks!!!

Best Peter