Deep copy complex object from device to host

Hello, I’m new to CUDA programming and I’m trying to figure out how to copy a complex object from the device to the host.
I have an object CUDAPopulation that needs to do some computation on the device.

CUDAPopulation.h:

class CUDAPopulation {
public:
    ...
    CUDAGenome **individuals;

    CUDAPopulation(unsigned int popSize, unsigned int genNum, Objective obj = MAXIMIZE);
    ...

private:
    // Performs fitness-proportionate selection to get an individual from the population.
    __device__ CUDAGenome *select();
    // Scales the individuals' scores to fitnesses.
    __device__ void scale();
    // Implements ascending odd-even transposition sort on the individuals of the population.
    __device__ void sort();
};

// Perform an avaluation on the elements of the given population.
__global__ void evaluate(CUDAPopulation *pop);

// Perform an evolution step (selection, crossover, mutation, replacement) on the given population.
__global__ void step(CUDAPopulation *pop);

I allocate the object on the device like so:

...
    // Create the population object on the host.
    CUDAPopulation *population = new CUDAPopulation(popSize, genNumber, CUDAPopulation::MINIMIZE);

    // Allocate the population on the device.
    CUDAPopulation *d_pop;
    cudaMalloc(&d_pop, sizeof(CUDAPopulation));
    cudaMemcpy(d_pop, population, sizeof(CUDAPopulation), cudaMemcpyHostToDevice);
    ...

Then I dinamically allocate individuals calling a kernel that helps me implement polymorphism in CUDA code and run the computation:

dim3 gridSize(popSize);
    dim3 blockSize(checksNumber);
    createCUDAPathGenomePopulation<<<gridSize, 1>>>(d_pop, d_checks, checksNumber);

    for (unsigned int i = 0; i < genNumber; i++) {
        evaluate<<<gridSize, blockSize>>>(d_pop);
        cudaDeviceSynchronize();
        step<<<gridSize, blockSize>>>(d_pop);
        cudaDeviceSynchronize();
    }

createCUDAPathGenomePopulation just initializes the individuals’ field in CUDAPopulation with a subclass of CUDAGenome called CUDAPathGenome.
Everything is ok so far, the problem comes when I try to copy the CUDAPathGenomes to the host after the computation to check the results:

...
    // Copy the population object.
    CUDAPopulation *pop = (CUDAPopulation *) malloc(sizeof(CUDAPopulation));
    cudaMemcpy(pop, d_pop, sizeof(CUDAPopulation), cudaMemcpyDeviceToHost);
    cudaCheckError();

    // Copy the address of the populations' individuals.
    CUDAGenome **tmpIndividuals = (CUDAGenome **) malloc(popSize * sizeof(CUDAGenome *));
    cudaMemcpy(tmpIndividuals, pop->individuals, popSize * sizeof(CUDAGenome *), cudaMemcpyDeviceToHost);
    cudaCheckError();

    // Copy the population's individuals.
    CUDAPathGenome **individuals = (CUDAPathGenome **) malloc(popSize * sizeof(CUDAPathGenome *));
    for (unsigned int i = 0; i < popSize; i++) {
        individuals[i] = (CUDAPathGenome *) malloc(sizeof(CUDAPathGenome));
        cudaMemcpy(individuals[i], tmpIndividuals[i], sizeof(CUDAPathGenome), cudaMemcpyDeviceToHost);
        cudaCheckError();
    }
    ...

CUDAPathGenome internally has some dynamically allocated arrays that get copied later.
When I call “cudaMemcpy(individuals[i], tmpIndividuals[i], sizeof(CUDAPathGenome), cudaMemcpyDeviceToHost);”, on line 16, I get an “invalid argument” error and the program stops (that’s what cudaCheckError() does).
I can’t figure out why I get the error and what I am doing wrong.
I’m using CUDA 9.0 on a 920m GPU (3.5 compute capability).
Thanks

pointers to regions dynamically allocated on the device (i.e. allocated in device code using e.g. new or malloc) cannot participate in host-issued cudaMemcpy operations.

You haven’t shown a complete case for inspection, but I’m guessing based on your description you have some in-kernel new or malloc operations going on in createCUDAPathGenomePopulation to create the individuals. Pointers to those individuals obtained in that fashion cannot be used in a host-issued cudaMemcpy operation.

[url]Programming Guide :: CUDA Toolkit Documentation

If you attempted to do so, “invalid argument” is the error I would expect.

Yes, it’s exactly like that, so i guess i need to give up on polymorphism and directly create a population of CUDAPathGenome, since I cannot have objects with virtual functions passed to a kernel.
Thank you for the fast and precise answer.

I faced same problem.

struct IntensityVal
{
    int width;
    int height;
    int bpp;
    BYTE* pImgBuf; //Store buffer of a RGB image
};
std::vector<IntensityVal>IntensityMat;

I have this struct and a vector of struct type.

How can I pass the vector in GPU?

You have a vector of buffer descriptors and you have the actual buffers. And initially both are on the host, and you want to re-create the same data structure on the device, correct? If so, you could do the following:

(1) Create a host-side temporary copy of the vector of buffer descriptors
(2) Iterate over the temporary copy of the vector of buffer descriptors. For each buffer descriptor:
(2a) Create a device allocation based on the size specified by the descriptor, e.g. with cudaMalloc. Record the address of the device buffer just created.
(2b) Copy the buffer content from the host pointer stored in the descriptor to the device pointer returned in step 2a
(2c) Replace the pointer value in the descriptor with the device pointer you just copied the data to
(3) Create a device allocation for a vector of buffer descriptors; record the pointer returned.
(4) Copy the temporary vector of buffer descriptors from the host to the device vector allocated in step 3.
(5) Discard the host-side temporary vector of buffer descriptors that was created in step 1
(6) Call your kernel, passing the address of the device-side vector of buffer descriptors recorded in step 3.