cudaMalloc and Structs and Pointers problem

Suppose I have 2 structs:

[codebox]struct a { int temp; };

struct b { struct a *c;};

global akernel (struct b *X)




//now, I make a C function to call my kernel:

void run()


struct b *X_H;

struct b *X_D;


//allocate host mem

//allocate device memory



//copy mem H to D

//run kernel

akernel <<<grid,threads>>> (X_D);

//copy stuff back



free(host memories);



I get a segmentation fault at that line, but only when trying to run on the device. I can’t figure out what it is I’m doing wrong but I’d really like to use nested structures with dynamically allocated memory (on the host first of course, not on GPU). I have a lot of b’s, but the size of c in each one might be different each time I run the program.



Additional Information:

GPU: GTX 275

Running Ubuntu 9.04 64-bit, CPU: Intel E8400

Cuda 2.3

Question: I get a segmentation fault at that line, but only when trying to run on the device

Ans: you should have memory fault in “cudaMalloc((void**)&X_D->c,sizeforc)”, try

cutilSafeCall( cudaMalloc((void**)&X_D->c,sizeforc) );

then cutilSafeCall would show error message.

first from “cudaMalloc((void**)&X_D,sizeof(struct B));” X_D contains address in device memory, this is O.K.

However you CANNOT use X_D->c since X_D->c is address in device memory, which cannot be access in host code directly.

You can try following code to allocate “struct b” in device

[codebox]struct b *X_D_host ;

X_D_host = (struct b *)malloc( sizeof(struct B) ) ;

// use X_D_host to keep address X_D->c, this is legal since X_D_host is host variable and X_D_host->c is also

// host variable, in fact &(X_D_host->c) = &X_D_host

// host variable X_D_host->c contains address in device memory

cutilSafeCall( cudaMalloc((void**)&X_D_host->c,sizeforc) );

// copy “structure b” in host to “structure b” in device

// since content of X_D_host->c is copy into X_D->c and X_D_host->c contains address in device memory

/ X_D->c contains address in device memory

cudaMalloc( (void**)&X_D, sizeof(struct B) );

CUDA_SAFE_CALL(cudaMemcpy( X_D, X_D_host, sizeof(struct B), cudaMemcpyHostToDevice) );


Question: how to copy data to X_D->c ?

ans: you cannot access X_D->c but you can use X_D_host->c instead

[codebox]// suppose array_c contains data and numOfc = number of data element in array_c

// then copy array_c to X_D_host->c, not X_D->c

CUDA_SAFE_CALL(cudaMemcpy( X_D_host->c, array_c,  sizeof(struct a)*numOfc, cudaMemcpyHostToDevice) );[/codebox]

Question: how to free device memory?

[codebox] cudaFree(X_D_host->c);


	free( X_D_host );[/codebox]

ps: to create linked-list in device memory is tedious but still can be done via recursive form like above


Thanks! I was headed down that path after posting but after about 4 hours I still didn’t get it to work until I carefully followed your code. Works great not, a lot of running around though haha…


This is a very useful post. Thanks very much. I just have a follow up question.

I have followed exactly the prescription above and can transfer my structure DEVICE onto the GPU with one of its arrays populated.


DEVICE *X_D_host;

X_D_host = (DEVICE *)malloc( sizeof(DEVICE ) );

cutilSafeCall( cudaMalloc((void**)&X_D_host->c1x,sizeforc) );

cudaMalloc( (void**)&X_D, sizeof(DEVICE );

CUDA_SAFE_CALL(cudaMemcpy( X_D, X_D_host, sizeof(DEVICE), cudaMemcpyHostToDevice) );

		integrals_2e_kernel<<<dimBlock,dimGrid>>>(d_number, new_number, X_D);

This all works well. Below is a sample kernel just to see how the array is used on the device:

__global__ void integrals_2e_kernel(double* d_number, double* new_number, DEVICE* device){

		*new_number = *d_number + d_F[4] + device->c1x[6];


At compile time I get the following warning:

./ Warning: Cannot tell what pointer points to, assuming global memory space

Question: Is it right to assume that using this method those arrays X_D_host->… will be transferred into global memory space? And I can then within the kernel transfer them to shared memory space?

Also, Do I need to make any extra definitions to avoid unexpected behaviour because the compiler cannot tell what the pointer is pointing to?

Thanks again for this very instructive post.

Hi! i am newbie for cuda programming. Now i try to make a C program use the GPU memory.

#define BIAS_SIZE 100

struct BIAS_ENTRY /* Bias array entries */
char inner_tbl; //left element
char inner_tbm; //middle element
char inner_tbr; //right element

struct DHGN /* DHGN node data structure */
char left; //left message
char right; //right message
char middle; //bottom message
int idx; //index
struct BIAS_ENTRY bias_element[BIAS_SIZE]; //Bias entry size

—The code above is the header file.

int ccl(char hleft, char hmiddle, char hright, int neuron)
int a, data;

for(a = 0; a < 100; a++)
if((hleft == node[neuron].bias_element[a].inner_tbl) && (hmiddle == node[neuron].bias_element[a].inner_tbm) && (hright == node[neuron].bias_element[a].inner_tbr))
data = a+1;
if((node[neuron].bias_element[a].inner_tbl == ‘\0’) && (node[neuron].bias_element[a].inner_tbm == ‘\0’) && (node[neuron].bias_element[a].inner_tbr == ‘\0’))
node[neuron].bias_element[a].inner_tbl = hleft;
node[neuron].bias_element[a].inner_tbm = hmiddle;
node[neuron].bias_element[a].inner_tbr = hright;
data = a+1;
return data;

–The above is the function that i want to run in the kernel function.

So, how i allocate the nested structure in the gpu memory?

The array of BIAS_ENTRY structs - “bias_element” - is the fixed size array, so the whole of the struct DHGN is likely to occupy the continuous block of the memory. If You are looking for the simplest solution, it’s enough to allocate the appropriate amount of device memory with

__device__ DHGN *dev_dhgn;

cudaMalloc((void**)&dev_dhgn, sizeof(DHGN) * number_of_DHGN_nodes);

and then just copy the DHGN nodes array from the host memory to the device memory pointed by dev_dhgn with the cudaMemcpy function. But if You really care about the speed and optimality, I advise You to consider using SOA (Structure of Arrays) instead of AOS (Array of Structures) by organising the DHGN nodes array in the CUDA memory in the more clever way and leverage memory optimization techniques such as: coalescing or utilizing the global memory (cc 2.0+) or “texture memory” cache.

How about the kernel function? Coz normally kernel function must declare in void type but for my case i need to return the data value. And after allocate the DHGN into gpu memory, is it the element of the DHGN struct can declare directly in the kernel function or i need allocate them also?

As for the first part of Your question, the simplest way to “return” the single variable from the kernel function is to declare it as the global device variable. Then, it is, of course, visible from the point of view of the kernel function. The another, more ellegant way is to allocate the device memory for the return value and pass the address as the argument to the kernel function. After execution of the kernel, You just have to copy it from the device to the host memory.

As for the second one, I’m not sure I get what You mean, but as far as the bias_element array is concerned, You don’t have to allocate the device memory for it, because it’s fixed-size and, as a result, the whole of the struct DHGN occupies the continuous block of the memory.