Memory allocation from a device function?

I am doing a project involving CUDA for my Masters thesis and I have run into a bit of a problem that I hope someone can help me out with. What I would like to do is to make a pointer and call something similar to malloc() on it from inside a device function. I have tried to simply create the variable and then take its address when using it as an argument to a function call, but it seems to create problems when I do that. Basically, what I want is a chunk of memory that can only be seen by one thread, that can be passed between device functions by pointer so that the data can be manipulated inside the function call. The memory would be created for each thread, but different data would be put into it.

Right now, to get around the problem, I am creating an array of the variable type on the host and calling cudaMalloc() to create an array with one element per thread. I then index the array from the device using the threadID. This solution feels like a bit of a hack to me and I would prefer to create the memory as needed.

Here is a host code example of what I would like to accomplish:

struct aStruct {

  int a;

  int b;

};

void foo()

{

  aStruct * myStruct = (aStruct*)malloc(sizeof(aStruct));

  bar(myStruct);

}

void bar(struct aStruct * myStruct)

{

  myStruct->a = 1;

  myStruct->b = 2;

}

On a related note, I would eventually like to be able to create data structures such as linked lists and trees on the GPU. I understand that this may be difficult do to the fact that the order of execution of the threads is not gauranteed, but I would like to know if it is possible, and if so, what is the prefered method?

Thanks

You can use local memory, however, dynamic allocation is not (in my understanding) allowed.

Building data structures like this will be very hard in the GPU. First, you need to have enough memory already allocated, because dynamic allocation isn’t allowed in device functions. Secondly, as you mention, you need to have knowledge of what other threads are doing.

What I’ve suggested to a colleague for his situation is to run only one block and one thread to build a data structure. It’s likely that he’d be better off running on the CPU and making the copy though.

Brian