Hi all,
I understand that polymorphism works in CUDA so long as the objects with virtual methods are created using the “new” keyword when running on the device. I have had some trouble figuring out a clean way to get pointers defined by running “new” on device back to the host though, so that I can copy data to them directly with cudaMemcpy.
As an example, I’ve tried to allocate an array of three doubles in a kernel using a very simple pointer wrapper struct (to hide the somewhat confusing notation of pointers to pointers), and copy that struct back to the host to find the pointer it holds (which has been defined by a call to “new”) to directly copy data to that length 3 array.
Here is my attempt to do that:
#include <cstdio>
#include <vector>
// Small wrapper for a pointer to reduce level of confusion of
// using pointers to pointers
template<typename T>
struct PtrWrapper
{
T* ptr;
__device__ void alloc(size_t howmany);
__device__ void free();
};
template<typename T>
__device__ void PtrWrapper<T>::alloc(size_t howmany) {
ptr = new T[howmany];
}
template<typename T>
__device__ void PtrWrapper<T>::free() {
delete[] ptr;
}
template<typename T>
__global__ void runAlloc(PtrWrapper<T>* wrap, size_t howmany) {
wrap->alloc(howmany);
}
template<typename T>
__global__ void runFree(PtrWrapper<T>* wrap) {
wrap->free();
}
template<typename T>
__global__ void printStuff(PtrWrapper<T>* wrap) {
int tid = threadIdx.x;
if (tid < 3)
{
printf("%f\n",wrap->ptr[tid]);
}
}
int main() {
// Host data to print from GPU as a test...
std::vector<double> asdf({1.0, 2.0, 3.0});
// Create pointer wrapper on device, and have it allocate
// an array of doubles of length 3
PtrWrapper<double>* wrap;
cudaMalloc(&wrap, sizeof(PtrWrapper<double>));
runAlloc<<<1,1>>>(wrap, asdf.size());
// Copy pointer wrapper back to host so the location of the array it
// manages is known, thus allowing cudaMemcpy to its ptr.
PtrWrapper<double> h;
cudaMemcpy(&h,wrap, sizeof(PtrWrapper<double>), cudaMemcpyDeviceToHost);
// Copy host array to device
cudaMemcpy(h.ptr, asdf.data(), sizeof(double) * asdf.size(), cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
// Print out data on device
printStuff<<<1,32>>>(wrap);
// Finish up
cudaDeviceSynchronize();
runFree<<<1,1>>>(wrap);
cudaFree(wrap);
}
I expect 1, 2, 3, to be printed, but I instead see 0,0,0. I have no clue why this is happening, because I thought the data should have been copied to the array I allocated.
Is anyone able to explain this behavior, or, alternatively, suggest a better method for moving pointers defined using “new” in a kernel back to the host?
Lastly, yes, I do understand that polymorphism may give pretty bad performance, at least according to the below wiki post from kitware. I’m just trying to establish some baseline CUDA port performance for a code system that leverages polymorphism quite heavily.
http://m.vtk.org/index.php/Virtual_Methods_in_the_Execution_Environment#Virtual_Methods_in_CUDA