Deep copy complex object from device to host

michelettiluka · October 28, 2017, 3:09pm

Hello, I’m new to CUDA programming and I’m trying to figure out how to copy a complex object from the device to the host.
I have an object CUDAPopulation that needs to do some computation on the device.

CUDAPopulation.h:

class CUDAPopulation {
public:
    ...
    CUDAGenome **individuals;

    CUDAPopulation(unsigned int popSize, unsigned int genNum, Objective obj = MAXIMIZE);
    ...

private:
    // Performs fitness-proportionate selection to get an individual from the population.
    __device__ CUDAGenome *select();
    // Scales the individuals' scores to fitnesses.
    __device__ void scale();
    // Implements ascending odd-even transposition sort on the individuals of the population.
    __device__ void sort();
};

// Perform an avaluation on the elements of the given population.
__global__ void evaluate(CUDAPopulation *pop);

// Perform an evolution step (selection, crossover, mutation, replacement) on the given population.
__global__ void step(CUDAPopulation *pop);

I allocate the object on the device like so:

...
    // Create the population object on the host.
    CUDAPopulation *population = new CUDAPopulation(popSize, genNumber, CUDAPopulation::MINIMIZE);

    // Allocate the population on the device.
    CUDAPopulation *d_pop;
    cudaMalloc(&d_pop, sizeof(CUDAPopulation));
    cudaMemcpy(d_pop, population, sizeof(CUDAPopulation), cudaMemcpyHostToDevice);
    ...

Then I dinamically allocate individuals calling a kernel that helps me implement polymorphism in CUDA code and run the computation:

dim3 gridSize(popSize);
    dim3 blockSize(checksNumber);
    createCUDAPathGenomePopulation<<<gridSize, 1>>>(d_pop, d_checks, checksNumber);

    for (unsigned int i = 0; i < genNumber; i++) {
        evaluate<<<gridSize, blockSize>>>(d_pop);
        cudaDeviceSynchronize();
        step<<<gridSize, blockSize>>>(d_pop);
        cudaDeviceSynchronize();
    }

createCUDAPathGenomePopulation just initializes the individuals’ field in CUDAPopulation with a subclass of CUDAGenome called CUDAPathGenome.
Everything is ok so far, the problem comes when I try to copy the CUDAPathGenomes to the host after the computation to check the results:

...
    // Copy the population object.
    CUDAPopulation *pop = (CUDAPopulation *) malloc(sizeof(CUDAPopulation));
    cudaMemcpy(pop, d_pop, sizeof(CUDAPopulation), cudaMemcpyDeviceToHost);
    cudaCheckError();

    // Copy the address of the populations' individuals.
    CUDAGenome **tmpIndividuals = (CUDAGenome **) malloc(popSize * sizeof(CUDAGenome *));
    cudaMemcpy(tmpIndividuals, pop->individuals, popSize * sizeof(CUDAGenome *), cudaMemcpyDeviceToHost);
    cudaCheckError();

    // Copy the population's individuals.
    CUDAPathGenome **individuals = (CUDAPathGenome **) malloc(popSize * sizeof(CUDAPathGenome *));
    for (unsigned int i = 0; i < popSize; i++) {
        individuals[i] = (CUDAPathGenome *) malloc(sizeof(CUDAPathGenome));
        cudaMemcpy(individuals[i], tmpIndividuals[i], sizeof(CUDAPathGenome), cudaMemcpyDeviceToHost);
        cudaCheckError();
    }
    ...

CUDAPathGenome internally has some dynamically allocated arrays that get copied later.
When I call “cudaMemcpy(individuals[i], tmpIndividuals[i], sizeof(CUDAPathGenome), cudaMemcpyDeviceToHost);”, on line 16, I get an “invalid argument” error and the program stops (that’s what cudaCheckError() does).
I can’t figure out why I get the error and what I am doing wrong.
I’m using CUDA 9.0 on a 920m GPU (3.5 compute capability).
Thanks

Robert_Crovella · October 28, 2017, 3:25pm

pointers to regions dynamically allocated on the device (i.e. allocated in device code using e.g. new or malloc) cannot participate in host-issued cudaMemcpy operations.

You haven’t shown a complete case for inspection, but I’m guessing based on your description you have some in-kernel new or malloc operations going on in createCUDAPathGenomePopulation to create the individuals. Pointers to those individuals obtained in that fashion cannot be used in a host-issued cudaMemcpy operation.

[url]Programming Guide :: CUDA Toolkit Documentation

If you attempted to do so, “invalid argument” is the error I would expect.

michelettiluka · October 28, 2017, 3:39pm

Yes, it’s exactly like that, so i guess i need to give up on polymorphism and directly create a population of CUDAPathGenome, since I cannot have objects with virtual functions passed to a kernel.
Thank you for the fast and precise answer.

rokonuzzamanreza · January 18, 2024, 8:20am

I faced same problem.

struct IntensityVal
{
    int width;
    int height;
    int bpp;
    BYTE* pImgBuf; //Store buffer of a RGB image
};
std::vector<IntensityVal>IntensityMat;

I have this struct and a vector of struct type.

How can I pass the vector in GPU?

njuffa · January 18, 2024, 8:42am

You have a vector of buffer descriptors and you have the actual buffers. And initially both are on the host, and you want to re-create the same data structure on the device, correct? If so, you could do the following:

(1) Create a host-side temporary copy of the vector of buffer descriptors
(2) Iterate over the temporary copy of the vector of buffer descriptors. For each buffer descriptor:
(2a) Create a device allocation based on the size specified by the descriptor, e.g. with cudaMalloc. Record the address of the device buffer just created.
(2b) Copy the buffer content from the host pointer stored in the descriptor to the device pointer returned in step 2a
(2c) Replace the pointer value in the descriptor with the device pointer you just copied the data to
(3) Create a device allocation for a vector of buffer descriptors; record the pointer returned.
(4) Copy the temporary vector of buffer descriptors from the host to the device vector allocated in step 3.
(5) Discard the host-side temporary vector of buffer descriptors that was created in step 1
(6) Call your kernel, passing the address of the device-side vector of buffer descriptors recorded in step 3.

Topic		Replies	Views
Copy pointer-based objects from host to device CUDA Programming and Performance	2	5091	October 7, 2008
How to copy Device Struct with pointers to Host? CUDA Programming and Performance	10	8498	July 8, 2014
copy data which is allocated in device from device to host CUDA Programming and Performance	2	4694	December 2, 2011
Nested Arrays cudaMemcpy of nested arrays CUDA Programming and Performance	4	5937	January 12, 2010
Complex structs in CUDA CUDA Programming and Performance	5	971	May 7, 2024
Copying struct from host to device using cudaMemcpy CUDA Programming and Performance	1	7715	May 13, 2011
CopyDeviceToHost:Invalid Argument CUDA Programming and Performance	0	41	September 17, 2024
structures, pointers, and cudaMalloc/Memcpy CUDA Programming and Performance	1	2857	August 3, 2011
cudamemcpy objects from host to device/GPU CUDA Programming and Performance	0	1647	April 18, 2011
Struct CUDA Programming and Performance	11	1755	June 30, 2016

Deep copy complex object from device to host

Related topics