How to copy a structure of arrays on GPU?

Hi,

I’ve got the following problem:

I have an array of structures

struct neighbours

{

	unsigned int		number;			// number of neighbours contributing to the gridding

	unsigned long int	* indices;		// array of length "number" containing the indices of the neighbours

};

where each member of the structures contains the array “indices” with “number” entries. So the size of the “indices” array is different for each element of the structure array.

Here the code with which I allocate the structure (on CPU):

struct neighbours * neigh;

neigh = new struct neighbours [ SIZE ];

// some function to calculate "number" is called

GetNeighbourNumber ( neigh, ... );

// after we know the number of neighbours, we can allocate the memory for neigh.index 

for ( int k = 0; k < SIZE; k++)

{

	neigh[k].indices = new unsigned long int [ neigh[k].number ];

}

// and fill the "indices" array in some other function

GetNeighbourIndex  ( neigh, ... );

Now I want to use this structure to do some calculation on GPU. How can I transfer the structure to the GPU? I guess I could use cudaMemcpy, but to do so I need to know how much memory I have to allocate for the structure. To my knowledge, the memory used by a structure is not the same as the sum of memory needed for every value contained within the structure.

Can you give me some advice? I’d appreciate any help!

OK, I tried it in a rather inelegant way:

// first allocate memory for the structure & copy it (already containing the number of neighbours)

struct neighbours	* d_neigh;

cudaMalloc ( (void**) &d_neigh,		( sizeof(neighbours) * SIZE ));

cudaMemcpy(d_neigh, h_neigh, ( sizeof(neighbours) * SIZE ), cudaMemcpyHostToDevice );

// then allocate memory for each indices-matrix & copy it (don't know if there's a more elegant method..  there ought to be...

for ( k = 0; k < SIZE; k++ )

{

	cudaMalloc( (void**) &(d_neigh[k].indices), (sizeof(unsigned long int) * h_neigh[k].number ) );

	cudaMemcpy( d_neigh[k].indices, h_neigh[k].indices, ( sizeof(unsigned long int) * h_neigh[k].number ), cudaMemcpyHostToDevice );

}

If I copy the d_neigh[k].indices - matrices back to the Host with cudaMemcpy and look at the entries, this actually works.

But when I try to use the values in a device function, I can access the values saved in d_neigh[k].number, but not the ones in the array d_neigh[k].indices. During compilation I get the warning

Warning: Cannot tell what pointer points to, assuming global memory space

and during execution it throws an exception.

So I assume I’m doing some pointer errors - and d_neigh[k].indices does not point in device memory but rather in host memory? How can I fix this? Other issues in this forum didn’t really help me in undestanding…

What the code in your second post does can’t work because you cannot dereference or access device memory directly from the host. When you compute this address

&(d_neigh[k].indices)

it is an address somewhere in host memory, not device memory. Also because there is no error checking in any of the API calls, you don’t actually know what is working and what is failing. Once d_neigh is allocated with a device pointer, it is illegal to try an access any of its elements directly in host code.

What you need to do is construct the device structure in host memory first, then copy that to the device. Like this

struct neighbours       * d_neigh, l_neigh;

l_neigh = new struct neighbours [ SIZE ];

// Assemble the device structure in host memory first

for ( k = 0; k < SIZE; k++ )

{

	unsigned long int * indices;

        cudaMalloc( (void**) &indices, (sizeof(unsigned long int) * h_neigh[k].number ) );

        cudaMemcpy( indices, h_neigh[k].indices, ( sizeof(unsigned long int) * h_neigh[k].number ), cudaMemcpyHostToDevice );

	l_neigh[k].number = h_neigh[k].number;

	l_neigh[k].indices = indices;

}

// Then copy that host memory version to device memory

cudaMalloc ( (void**) &d_neigh, ( sizeof(struct neighbours) * SIZE ));

cudaMemcpy(d_neigh, l_neigh, ( sizeof(struct neighbours) * SIZE ), cudaMemcpyHostToDevice );

If I were you, I’d code up a very primitive memory allocator for this task. Specifically, once you’ve invoked GetNeighbourNumber ( neigh, … ); you know the total amount of memory you need for all the indices across all the neigbours structures put together. So, I would malloc that amount of unsigned longs as a single host array, which may be even defined as a global variable (if you’re sure you need only one array of neighbors in your code). Then, I would replace member *indices with an integer or size_t typed member called offset, which effectively points into that global array. With this modification, copying the stuff to the GPU becomes a trivial task: copy that global array in one cuda-memcopy, then the array of neighbors in another cuda-memcopy. If you need to keep compatibility with the old host code, you might keep the redundant member *indices (host pointer) in addition to the new member offset of neighbours structure, but be sure to use that redundant pointer on the host only.

Very roughly, the pseudo-code could be:

// NEW CODE: global host-based store for all indices, across all neighbours.

unsigned long int *g_indices;

struct neighbours

{

	unsigned int		number;		// number of neighbours contributing to the gridding

	unsigned int	        offset;		// the offset of neighbours within g_indices array.

        // OPTIONAL: COMPATIBILITY ONLY: redundant host pointer into g_indices.

        unsigned long int *indices;

};

void f() {

struct neighbours       * d_neigh, l_neigh;

// NEW CODE: the array of all the indices put together on the device.

   unsigned long int       * d_indices; 

l_neigh = new struct neighbours [ SIZE ];

// some function to calculate "number" is called

   GetNeighbourNumber ( l_neigh, ... );

// NEW CODE: calculate the total amount of indices required.

   unsigned num_indices = sum over 0<=i<SIZE of l_neigh[i].number;

// NEW CODE: allocate memory for all the indices on the host.

   g_indices = malloc(sizeof(unsigned long int) * num_indices);

// NEW CODE: populate offsets within the host neighbours, optionally initialize indices pointers.

unsigned current_offset = 0;

   for (int i = 0; i < SIZE; ++i) {

       l_neigh[i].offset = current_offset;

current_offset += l_neigh[i].number;

// OPTIONAL: only if host pointers to indices are required for compatibility reasons.

       l_neigh[i].indices = g_indices + l_neigh[i].offset;

   }

// and fill the GLOBAL "indices" array in some other function

   GetNeighbourIndex  ( l_neigh, ... );

// NEW CODE: cuda-malloc d_neigh and d_indices on the device, same sizes as on the host.

// NEW CODE: cuda-copy l_neigh into d_neigh and g_indices into d_indices.

// NEW CODE: launch the kernel, passing d_neigh and d_indices to it, as parameters.

// Within the kernel, access index i of neighbours n as d_indices[d_neigh[n].offset + i].

   // DO NOT use optional indices pointer within the kernel!

free(g_indices); g_indices = 0;

}

You might consider declaring d_indices as a constant global variable on the device, rather than a kernel parameter, so that it reflects the global nature of g_indices array on the host.

Needless to say, all of the above assumes that `unsigned long int’ has the same encoding on the device and the host.

I don’t know how good CUDA currently is with allocating and copying a large number of small arrays from host to device, which your original solution relied upon. The above solution uses only two, rather than SIZE copies, which may make the code a bit more robust and performing.

Hope this helps.

Thank you both very much for your good advice, it was really helpful!!!

I prefer the one-array-containing-all-indices solution as the allocation of & copying to the device memory in the loop really was prohibitively slow! Now it’s working and much, much faster.

How can I find out if this is the case? I would use cuda types, but I get an error when trying to use an “unit1” value as index…

I don’t have experience with this type. Be careful with it: expect a whole bunch of performance issues with memory coalescing on the device.

How do I copy the results back to host memory? I tried something like this :

struct neighbours *output; // some output host pointer

struct neighbours *output_d; // some output device pointer


//say output is similar to h_neigh and I want the kernel to do some computation on values in h_neigh and copy the results back to output //


for each k

{

float *p_neigh;

cudaMalloc ( (void**) &p_neigh, (sizeof(unsigned long int) * h_neigh[k].number ));

kernel<<<grid, block>>>(l_neigh[k],p_neigh,output_d[k]);

//I don’t think I can do cudaMemcpy(output[k].indices, output_d[k].indices, (sizeof(unsigned long int) * h_neigh[k].number ), cudaMemcpyDeviceToHost);

cudaMemcpy(output[k].indices, p_neigh, (sizeof(unsigned long int) * h_neigh[k].number ), cudaMemcpyDeviceToHost);

}

kernel(l_neigh[k],p_neigh,output_d[k])

{

//some computation and put output in output_d[k].indices

p_neigh = output_d[k].indices;

}

Why is this wrong ? This is not working. I am creating a device pointer p_neigh and in the kernel this points to the output values. Then I copy p_neigh to output.

please help asap !