Clean way of copying a struct with pointers to the GPU

I have a simple struct that contains other structs that contain pointers to some data (note that I’m only including the relevant stuff):

template<class T>
struct MyArray {
    T* data;
    int elementCount;
}

struct Wrapper {
    MyArray<float> arrayA;
    MyArray<float> arrayB;
}

First I initialize my Wrapper struct (host) and then copy it over to my device:

Wrapper wrapperHost;
// initialize the arrays (fill data arrays with '5.0f')and other related stuff
COMPUTE_SAFE(cudaMemcpyToSymbol(wrapperDevice, &wrapperHost, sizeof(Wrapper))); // Works fine

Device-side code looks like this:

__device__ Wrapper wrapperDevice; 

__global__ void TestKernel(){
    printf("%.2f", wrapperDevice.arrayA.data[0]); // Target result: 5.00
}

When I run my kernel I get a memory access violation error, which is expected since the data pointers point to host memory (I think) - now my question is: how can I copy the Wrapper struct with its arrays over to my device?

Thanks in advance, also note that this was written in-browser, so there’s a good chance that there are typos/minor errors in the code examples.

This is a frequently asked question. It involves a “deep copy”.

There are at least dozens of questions like it already on various forums.

Here is one example.

Here is one possible approach using your code:

$ cat t2102.cu
#include <cstdio>
#define COMPUTE_SAFE(x) x
template<class T>
struct MyArray {
    T* data;
    int elementCount;
};

struct Wrapper {
    MyArray<float> arrayA;
    MyArray<float> arrayB;
};

__device__ Wrapper wrapperDevice;

__global__ void TestKernel(){
    printf("%.2f\n", wrapperDevice.arrayA.data[0]); // Target result: 5.00
}

int main(){

  Wrapper wrapperHost;
  // initialize the arrays (fill data arrays with '5.0f')and other related stuff
  const int ds = 1;
  wrapperHost.arrayA.data = new float[ds];
  wrapperHost.arrayA.data[0] = 5.0f;
  wrapperHost.arrayA.elementCount = ds;
  Wrapper wrapperDeep;
  cudaMalloc(&(wrapperDeep.arrayA.data), wrapperHost.arrayA.elementCount*sizeof(wrapperHost.arrayA.data[0]));
  wrapperDeep.arrayA.elementCount = wrapperHost.arrayA.elementCount;
  cudaMemcpy(wrapperDeep.arrayA.data, wrapperHost.arrayA.data, wrapperHost.arrayA.elementCount*sizeof(wrapperHost.arrayA.data[0]), cudaMemcpyHostToDevice);
  COMPUTE_SAFE(cudaMemcpyToSymbol(wrapperDevice, &wrapperDeep, sizeof(Wrapper)));
  TestKernel<<<1,1>>>();
  cudaDeviceSynchronize();
}

$ nvcc -o t2102 t2102.cu
$ compute-sanitizer ./t2102
========= COMPUTE-SANITIZER
5.00
========= ERROR SUMMARY: 0 errors
$

It should be evident that I’ve only done the “deep copy work” for arrayA. You would need to handle it similarly for arrayB.

1 Like

Perfect, this is what I was looking for! I guess I just searched for the wrong terms :D . Thanks!

An alternative that does not require deep copies is to use Managed memory for any data structure you need to access on the GPU. Then the CPU and GPU share the same address space, and the driver handles copying of data chunks between host and device on an as-needed basis. Deep copying won’t be needed as pointers remain valid.

This solution may require the use of custom allocators for non-POD objects.

Christian

Thanks Christian, I’ll check it out!

hey, I know this isn’t really related to the question, but would you mind showing me a way of copying the data back from the symbol to host? I’ve been trying to do it myself, but I’m having trouble getting it working.

This is one of my attempts at making it work (I’m also copying only 1 array for now):

Wrapper out;
cudaMemcpy(&out.arrayA.data, deep.arrayA.data, host.arrayA.elementCount * sizeof(host.arrayA.data[0]), cudaMemcpyDeviceToHost);
COMPUTE_SAFE(cudaMemcpyFromSymbol(&out, wrapperDevice, sizeof(void*), 0, cudaMemcpyDeviceToHost));
printf("%.2f, out.arrayA.data[0])

But this causes an Access violation reading location exception.
Thanks in advance.

Wrapper out;

does provide storage for pointers, but doesn’t provide any allocation for them. There is no allocated space to copy an array of data to. Furthermore:

cudaMemcpy(&out.arrayA.data,

that doesn’t make any sense. You’re attempting to copy an array of data to the address of where a (unallocated) pointer is stored.

first provide a proper host allocation/storage area for the output data, then copy the device data to it:

Wrapper out;
out.arrayA.elementCount = host.arrayA.elementCount;
out.arrayA.data = new float[out.arrayA.elementCount];
cudaMemcpy(out.arrayA.data, deep.arrayA.data, out.arrayA.elementCount * sizeof(out.arrayA.data[0]), cudaMemcpyDeviceToHost);
for (int i = 0; i < out.arrayA.elementCount; i++) std::cout << out.arrayA.data[i] << std::endl;

I don’t know how I missed that. Thanks a bunch.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.