How do you copy entire struct's with pointers? Problem with copying structures containing pointe

Hi all,

I am having a problem with CUDA regarding structures with pointers. I have been able to successfully allocate memory, the problem comes when you want to copy data within a struct on the host to the device.

For example:

typedef struct

{

float* ptr;

}A;

If the float was a unified type you could write:

SomeStructure.Ptr.CopyToDevice();

or

SomeStructure.Ptr.CopyFromDevice();

That’s how I implemented it anyway External Image

But I am losing interest in cuda for the moment because of slow gt 520 gpu External Image

What do you mean by a unified type? I don’t quite understand what you meant by the two functions CopyToDevice() and CopyFromDevice().

Can you explain?

Kisty

Here is an example of what I want to do.

#include <stdio.h>

#include <assert.h>

#include <cuda.h>

#include <cuda_runtime.h>

typedef struct

{

	float* ptr;

}A;

	

/**

 * Kernel to link the structure on the device

 */

__global__ void device_kernel_link_struct(A* struct_d, float* array_d)

{

	struct_d->ptr = array_d;

}

/**

 * kernel to get array pointer inside an A struct

 */

__global__ void device_kernel_get_array(float** array_ptr_d, A* struct_d)

{

	*array_ptr_d = struct_d->ptr;

}

int main()

{

	A* myStruct_h;

	A* myStruct_d;

	float* myArray_d;

	float** myArray_ptr_d; 

	float** myArray_ptr_fromDevice_h;

	int size = 20;

	

	/* To allocate structure on host */

	myStruct_h = (A*) malloc(sizeof(A));

	myStruct_h->ptr = (float*) malloc(sizeof(float) * size);

	int i; 

	for (i = 0; i < size; i++)

	{

		myStruct_h->ptr[i] = 1;

		//printf("ptr[%d] = ");

		assert(myStruct_h->ptr[i] == 1);

	}

	

	printf("myStruct_h->ptr = %p\n", myStruct_h->ptr);

	/* To allocate structure on device */

	cudaMalloc((void**) &myStruct_d, sizeof(A));

	cudaMalloc((void**) &myArray_d, sizeof(float) * size);

	//Use a kernel to link the two together

	device_kernel_link_struct<<<1,1>>>(myStruct_d, myArray_d);

	

	/* Copy an A structure to the device*/

	// Allocate memory for the pointer to a pointer

	cudaMalloc((void**) &myArray_ptr_d, sizeof(float*));

	// Get device pointer to ptr inside myStruct_d via a kernel?

	device_kernel_get_array<<<1,1>>>(myArray_ptr_d, myStruct_d);

	

	// Check the pointer

	printf("myArray_ptr_d = %p\n", myArray_ptr_d);

	myArray_ptr_fromDevice_h = (float**) malloc(sizeof(float*));

	cudaMemcpy(myArray_ptr_fromDevice_h, myArray_ptr_d, sizeof(float*), cudaMemcpyDeviceToHost);

	printf("myArray_ptr_d = %p\n", myArray_ptr_d);

	printf("myArray_ptr_fromDevice_h = %p\n", myArray_ptr_fromDevice_h);

	printf("*myArray_ptr_fromDevice_h = %p\n", *myArray_ptr_fromDevice_h);

	printf("\n");

	// Copy the host A array to device

	cudaMemcpy(*myArray_ptr_d, myStruct_h->ptr, sizeof(float) * size, cudaMemcpyHostToDevice);

	

	

	return 0;

}

In reality I have a structure pointing to an array of structures and a structure which contains the dimensions of the array.

A mechanism to extract a pointer in a structure on the device would be most ideal.

This is how I would do it but is there an easier way?

Kisty

Unified types would allow you to describe your data structure just once.

The unified types allocate cuda and host memory at the same time.

Providing this functionality on the host side can/could be done with c++ advanced features.

The cuda side can just have it’s own data structure as normal.

For now if you do not have unified types available the only thing you can do is write double code as follows:

MyDataStructure
{
int MyCudaField;
int MyHostField

int *MyCudaPointer;
int MyHostPointer;
}

And so forth…

Now you just need code to copy back and forth between these two.

CopyToDevice( MyHostField, MyCudaField );
CopyFromDevice( MyHostField, MyCudaField );

CopyToDevice( MyHostPointer, MyCudaPointer );
CopyFromDevice( MyHostPointer, MyCudaPointer );

// however these two statements above are not necessary since you should do:

AllocOnHost( MyHostPointer, MySize );
AllocOnCuda( MyCudaPointer, MySize );

This way your data structure is created, allocated and initialized the same way on host and device side.

Now you just need to pass pointers.

This could be done in a pointer passing structure as well just to make sure limit is not hit.

ParameterStructure
{
CudaPointer *Parameter;
}

Parameter[0] = &MyStructure;
Parameter[1] = &MyStructure.MyCudaPointer; // same as .MyCudaArray;

Then pass this parameter to kernel

KernelLaunch( ParameterStructure );

Finally inside cuda kernel initialize data structure one last time just to be sure… if you need pointers like that

MyStructure = Parameter[0]
MyStructure.MyCudaPointer = Parameter[1];

^ All of these fields would need to be pointers.

This step could be left out if you sure data structure on device has same layout as on host… as it was allocated…

Then only thing necessary is copy cuda pointers to device.

So if you structure contains cuda data types/pointers only this might work, in reality it will probably not work, since some kernel parameters need to be host side, though this could be worked around somewhat by passing cuda parameters only…

This should give you some idea…

If this is too difficult for you to understand then here is an alternative easier solution but requires more programming effort for you every time:

LEt’s see if I understand correctly first:

You have host structure with a host pointer which points to another host array.

You have device structure with a device pointer which points to another device array.

You want the device pointer to be initialized properly.

What you would need to do is apply pointer arithmetic on the cuda pointer which is the device structure pointer.

According to others cuda pointer arithmetic is possible on cuda pointers on the host side.

This requires to know the offset of the fields compared to the base address of the structures.

However if the size of the fields of the structures are the same on host and device side then all that is necessary is to lay the structure on top of each other.

So what you can do is quite simply:

You describe you data structure just once like so:

MyDataStructure
{
int MyField;
int *MyArray;
};

Then you allocate the structure on host side.
Then you allocate the structure on device side.

Then you allocate the array on host side.
Then you allocate the array on cuda side.

Ok you do it a little bit different host side first cuda side then but that don’t matter.

The point is where do you store these pointers returned from the cuda malloc calls.

All you need to do is this:

TypecastCudaPointerToStructureToMyDataStructure( CudaPointerToStructure ).MyArray = CudaArrayPointer;

So long story short your code needs to look something like:

myStruct_h( myStruct_d )->ptr = myArray_d;

The only little problem is with ptr… it’s pointing to a host data type.

But this doesn’t matter if the pointers are of the same size.

So you should first look into if your kernel is using 32 bit and your host code 32 bit, or if it’s both 64 bit or mixed bit.

If it’s the same pointer size, then this technique will work.

You can change all your pointer types to void * or some other more “unified looking type”.

How to do these typecasts exactly you should be able to figure that out yourself… not gonna write that entire boring code but it’s something like this:

If using a nice pointer type:
PDataStructure( my structt )

Else you’ll have to use those nasty asterixes:

(*myStruct_h)(my_struct_d)->ptr = etc;

I believe Skybuck is referring to unified virtual addresses when he writes “unified types”.
If you are using CUDA 4.0 on compute capability 2.x hardware, this indeed simplifies matters considerable, as you don’t need to convert host pointers to device pointers anymore.

However, from a performance point of view, pointers generally lead to bad memory access patterns and are best avoided. With CUDA, the best approach usually is to flatten data structures and just operate on arrays.

Not really, unified types can also be build manually so it doesn’t necessarily require cuda 4.0 External Image