How to move a pointer created in a kernel using "new" to the host? (for polymorphism)

Hi all,

I understand that polymorphism works in CUDA so long as the objects with virtual methods are created using the “new” keyword when running on the device. I have had some trouble figuring out a clean way to get pointers defined by running “new” on device back to the host though, so that I can copy data to them directly with cudaMemcpy.

As an example, I’ve tried to allocate an array of three doubles in a kernel using a very simple pointer wrapper struct (to hide the somewhat confusing notation of pointers to pointers), and copy that struct back to the host to find the pointer it holds (which has been defined by a call to “new”) to directly copy data to that length 3 array.

Here is my attempt to do that:

#include <cstdio>
#include <vector>

// Small wrapper for a pointer to reduce level of confusion of
// using pointers to pointers
template<typename T>
struct PtrWrapper
{
  T* ptr;
  __device__ void alloc(size_t howmany);
  __device__ void free();
};
template<typename T>
__device__ void PtrWrapper<T>::alloc(size_t howmany) {
  ptr = new T[howmany];
}
template<typename T>
__device__ void PtrWrapper<T>::free() {
  delete[] ptr;
}

template<typename T>
__global__ void runAlloc(PtrWrapper<T>* wrap, size_t howmany) {
 wrap->alloc(howmany);
}

template<typename T>
__global__ void runFree(PtrWrapper<T>* wrap) {
 wrap->free();
}

template<typename T>
__global__ void printStuff(PtrWrapper<T>* wrap) {
  int tid = threadIdx.x;
  if (tid < 3)
  {
    printf("%f\n",wrap->ptr[tid]);
  }
}

int main() {
  // Host data to print from GPU as a test...
  std::vector<double> asdf({1.0, 2.0, 3.0});

  // Create pointer wrapper on device, and have it allocate
  // an array of doubles of length 3
  PtrWrapper<double>* wrap;
  cudaMalloc(&wrap, sizeof(PtrWrapper<double>));
  runAlloc<<<1,1>>>(wrap, asdf.size());


  // Copy pointer wrapper back to host so the location of the array it
  // manages is known, thus allowing cudaMemcpy to its ptr.
  PtrWrapper<double> h;
  cudaMemcpy(&h,wrap, sizeof(PtrWrapper<double>), cudaMemcpyDeviceToHost);

  // Copy host array to device
  cudaMemcpy(h.ptr, asdf.data(), sizeof(double) * asdf.size(), cudaMemcpyHostToDevice);
  cudaDeviceSynchronize();

  // Print out data on device
  printStuff<<<1,32>>>(wrap);

  // Finish up
  cudaDeviceSynchronize();
  runFree<<<1,1>>>(wrap);
  cudaFree(wrap);
}

I expect 1, 2, 3, to be printed, but I instead see 0,0,0. I have no clue why this is happening, because I thought the data should have been copied to the array I allocated.

Is anyone able to explain this behavior, or, alternatively, suggest a better method for moving pointers defined using “new” in a kernel back to the host?

Lastly, yes, I do understand that polymorphism may give pretty bad performance, at least according to the below wiki post from kitware. I’m just trying to establish some baseline CUDA port performance for a code system that leverages polymorphism quite heavily.

http://m.vtk.org/index.php/Virtual_Methods_in_the_Execution_Environment#Virtual_Methods_in_CUDA

First of all, I don’t see any virtual methods in your code. So, if it were me, I would not use the word polymorphism. If you actually had virtual methods, and were concerned about the associated function pointer table, it would not be a trivial matter to modify that function table upon passing an object across the device/host boundary.

Instead, your question seems to just revolve around “ordinary” pointers contained in objects.

If you used proper CUDA error checking, and/or ran your code with cuda-memcheck, you would get a report something like this:

Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemcpy.

We now have a clue. One of the pointers is invalid. The invalid pointer is the device pointer. Device pointers created using in-kernel new or malloc (or in-kernel cudaMalloc()) may not participate in host CUDA API calls:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#interoperability-host-memory-api

There is no trivial way to resolve this. It’s rather difficult to copy data from a pointer created by in-kernel new or malloc to the host. While this isn’t precisely the case you have, it may give some indication of the hoops you have to jump through:

https://stackoverflow.com/questions/59127547/cuda-dynamically-reallocate-more-global-memory-in-kernel/59127868#59127868

You can certainly do polymorphism in CUDA device code without requiring the use of in-kernel new or its equivalents.

https://stackoverflow.com/questions/22988244/polymorphism-and-derived-classes-in-cuda-cuda-thrust/23476510#23476510

Hi Robert,

So, let me clarify. My intention here was to test an example of allocating objects using “new” on the device, not necessarily having those objects use polymorphism. I thought this would be good to try since polymorphic device objects only work if they were allocated on device (if I understand that correctly?). So, in using polymorphism in practice, I would take an approach similar to this, where PtrWrapper would be instead something along the line of a factory class that creates new instances of polymorhpic objects on the device, and returns pointers to them.

Either way, the advice you provided completely answers my question. Thanks! I was unaware that pointers made with device malloc don’t make sense on host. The examples you provided are exactly what I needed. I just wasn’t using the right google keywords :D

I believe it might be clearer to say the object needs to be constructed on the device. The point of this distinction may become evident in a moment.

No doubt that is workable, and probably mimics a design pattern you are familiar with in host code. What’s not workable (easily) is copying data from such objects directly to host memory.

An alternative that I tried to indicate in my last linked article above is to pre-allocate for the objects you will need, using a host side API (e.g. cudaMalloc in host code), then construct the objects on the device side. Looking at it again, it may not be that obvious. The thrust::device_vector foo(N) call first allocates space, using cudaMalloc, under the hood, in host code, for a vector of Rectangle objects of length N. Subsequent to that, the same device_vector call launches a CUDA kernel, under the hood, which calls the Rectangle object constructor on each of those objects in the vector.

I acknowledge this has a variety of limitations and probably doesn’t fit your desired design pattern. I’m pointing it out anyway so future readers have some options to consider.

This might also be one of those odd situations where placement new might have an interesting use-case.

OK, thank you for clarifying your example there.

What I’m really after, end of the day, is heterogeneous polymorphism. Something like a thrust::vector<Polygon*>. Can you think of any good way to get a functionality where a factory pushes objects back to this? My thought was to make a kernel called make_triangle and make_rectangle that take a pointer to a pointer to a Polygon, and construct the triangle/rectangle, and returning that pointer. But then… constructing the polygon with new won’t be communicable back to the host due to the memory incompatibility you mentioned.

So, I suppose this only leaves the option of pre-allocating some buffer for both Rectangles and Triangles from host on device, and using directed new to construct Rectangles and Triangles on device in the correct buffer, and finally returning pointers to the newly constructed objects on the host. From that, a heterogeneous list of pointers could be built. Maybe? Yet to test it… I can’t imagine any other way to achieve heterogeneous polymorphism.

So, yes, placement new is the key to achieving this simply, IMO. Posting an MWE in case anybody else gets stuck on this.

//-------------------------------------------------
// These three classes are textbook polymorphism.
//-------------------------------------------------

class Base
{
  public:
    __device__ virtual void doThing() = 0;
};

class Derived1 : public Base
{
  double firstvalue;
  double secondvalue;
public:
  __device__ Derived1(double x, double y) : firstvalue(x), secondvalue(y) {}
  __device__ virtual void doThing() override {
    printf("Greetings! I am Derived1, and I contain the values\n"
        " %f and %f.\n", firstvalue, secondvalue);
  }
};

class Derived2 : public Base
{
  char firstchar;
  char secondchar;
public:
  __device__ Derived2(char x, char y) : firstchar(x), secondchar(y) {}
  __device__ virtual void doThing() override {
    printf("Waddup! I am Derived2, with the two characters\n"
        " %c and %c.\n", firstchar, secondchar);
  }
};

//---------------------------------------------------------
// This kernel will (hopefully) call a polymorphic method.
//---------------------------------------------------------

__global__ void call_virtual_method(Base* b) {
  b->doThing();
}

__global__ void call_der1_constructor(Derived1* d, double x, double y) {
  new(d) Derived1(x, y);
}
__global__ void call_der2_constructor(Derived2* d, char x, char y) {
  new(d) Derived2(x, y);
}

int main() {

  // These will be pointers to objects on the device.
  Derived1* der1;
  Derived2* der2;

  cudaMalloc(&der1, sizeof(Derived1));
  cudaMalloc(&der2, sizeof(Derived2));

  call_der1_constructor<<<1,1>>>(der1, 3.0, 5.0);
  call_der2_constructor<<<1,1>>>(der2, 'c', 'g');

  Base* b1 = der1;
  Base* b2 = der2;

  call_virtual_method<<<1,1>>>(b1);
  call_virtual_method<<<1,1>>>(b2);

  cudaFree(der1);
  cudaFree(der2);
}