Buggy (?) compiler behavior with polymorphism in GPU device code

peter.zaspel · December 19, 2017, 11:34am

Dear forum,

I am currently about do develop some larger library for multi-GPU clusters. In order to allow users to add custom functionalities to the library, I started to include CUDA device code polymorphism into the code. By doing that, I run into “Hardware Stack Overflow” issues. To be able to present my problem in this forum, I created a “minimalistic” example (sorry, it is really the smallest possible one, even though it does not look like that), which currently runs on “Piz Daint”. That is, I use a combination of the Cray C++ compiler and CUDA 8.0. Moreover, the GPU is a Tesla P100 and I compile the code for compute capability 6.0.

The situation is as follows:

I am well aware of the fact, that you cannot pass objects of classes with virtual functions to any kernel. Therefore, for me, the currently only available way to implement my code is to follow the example in

https://stackoverflow.com/questions/26812913/how-to-implement-device-side-cuda-virtual-functions

at the end of the post.

In the following, I present my personal code which is only a very slight modification of this. I have five files (plus some headers, error checking which I do not include here):

main.cpp:

#include <cuda_runtime.h>
#include <stdlib.h>
        
#include "error_check.h"
#include "function_launcher.h"
#include "child.h"

int main(int argc, char* argv[])
{
        struct parent_class** c_d_p;
        cudaMalloc((void***)&c_d_p, sizeof(struct parent_class*));
 
        create_child_object(c_d_p);

        do_something(c_d_p);
        
        destroy_child_object(c_d_p);
        
        cudaFree(c_d_p);
}

It basically does the pointer allocation, creates the object on device, calls the virtual function and destroys the object on device.

parent.h:

class parent_class
{
        public:
                virtual __device__ double doit() =0;
};

This is a purely virtual parent class.

child.h:

#include "parent.h"

class child_class : public parent_class
{               
        public:

                __device__ double doit()
                {
                        return 1.0;
                }

};

extern void create_child_object(struct parent_class** c);

extern void destroy_child_object(struct parent_class** c);

This is the child class implementing “doit” plus some headers for the allocation + destruction of the object.

child.cu:

#include <stdio.h>
#include "error_check.h"

#include "parent.h"
#include "function_launcher.h"
#include "child.h"

__global__ void create_child_object_kernel(struct parent_class** c)
{
        (*c) = new child_class();
}

void create_child_object(struct parent_class** c)
{
        create_child_object_kernel<<<1,1>>>(c);
        cudaThreadSynchronize();
        checkCUDAError("create_child_class_object");
}

__global__ void destroy_child_object_kernel(struct parent_class** c)
{
        delete *c;
}

void destroy_child_object(struct parent_class** c)
{
        destroy_child_object_kernel<<<1,1>>>(c);
        cudaThreadSynchronize();
        checkCUDAError("destroy_child_class_object");
}

This is the implementation of the construction / destruction code for the device object.

function_launcher.cu:

#include <stdio.h>
#include "error_check.h"
#include "function_launcher.h"
#include "child.h"

__global__ void do_something_kernel(parent_class** p)
{
        double r= (*p)->doit();
}

void do_something(parent_class** p)
{
        do_something_kernel<<<1, 1>>>(p);
        cudaThreadSynchronize();
        checkCUDAError("do_something_kernel");

}

// __global__ void create_child_object_kernel(struct parent_class** c)
// {       
//         (*c) = new child_class();
// }
//         
// void create_child_object(struct parent_class** c)
// {       
//         create_child_object_kernel<<<1,1>>>(c);
//         cudaThreadSynchronize();
//         checkCUDAError("create_child_object");
// }

This function shall be part of a generic, binary library. Therefore, it is placed in a separate file. (Please ignore for now the commented out code…)

If I compile the above code (plus some header stuff), it compiles perfectly. However, when launching it, I get a hardware stack problem. cuda-memcheck e.g. sais:

========= Hardware Stack Overflow
=========     at 0x002a1450 in do_something_kernel(parent_class**)
=========     by thread (0,0,0) in block (0,0,0)
=========     Device Frame:do_something_kernel(parent_class**) (do_something_kernel(parent_class**) : 0x90)
=========     Device Frame:do_something_kernel(parent_class**) (do_something_kernel(parent_class**) : 0x90)

I hunted it down to the point that it seems that the device code wants to launch the doit function, however, it then goes to some undefined place. (At an earlier point of my tests I also got an “Illegal Instruction” error, when compiling with “-g”. However, this one dissapeared while reducing my code to this “minimalistic” example above.

However, if I move the “create_child_object” function + the appropriate kernel to the function_launcher.cu code (i.e. commenting it in in function_launcher.cu and commenting it out int he other file), then it works perfectly fine.

My point is that it seems that the compiler is only able to detect polymorphism if the “new child_class()” is in the same code file as the call to that function. However, this does not seem to be a documented compiler limitation.

Is this a bug or am I doing something wrong?

It would be great, if someone could give me a hint for this very, very technical question.
(… and thanks already for reading up to this point…)

Best regards,

Peter

Robert_Crovella · December 19, 2017, 3:33pm

are you compiling and linking with relocatable device code enabled?

peter.zaspel · December 19, 2017, 4:32pm

You just fixed my problem!

I wasn’t aware of this relocation business (even though I should have been…).

Many, many thanks!!!

Best Peter

Topic		Replies	Views
Strange crashes in __device__ function CUDA Programming and Performance	4	1065	August 1, 2016
Invalid device function CUDA Programming and Performance	10	6828	February 25, 2015
Copying objects to device with virtual functions CUDA Programming and Performance	5	3752	November 9, 2017
CUDA Dynamic Parallelism undefined reference to __fatbinwrap Legacy PGI Compilers	5	11990	April 28, 2015
cudaErrorInvalidDeviceFunction CUDA Programming and Performance cuda , jetson	6	2811	September 26, 2022
memcopy fails in multiple pthreads with cudaSetDevice() i m unable to use pthread with multiple GPUs CUDA Programming and Performance	5	3282	August 8, 2011
A few thaughts about CUDA CUDA Programming and Performance	8	7758	January 7, 2010
Debugging "invalid device function" error (compiled for correct arch) CUDA Programming and Performance	6	1122	December 8, 2021
cudaDeviceSynchronize from device code is deprecated CUDA Programming and Performance	15	6979	March 18, 2024
CUDA and C++ ? Namespaces, classes, templates :( CUDA Programming and Performance	6	12986	April 29, 2009

Buggy (?) compiler behavior with polymorphism in GPU device code

Related topics