I would like to know how to create a linked list using unified memory on the host and be able to add and delete nodes on the GPU or device. I keep reading that it is possible but I have not been able to make it work. I create the linked list on the host and I can access it on the device. However, if I create a node on the device and try to add it to the list the node isn’t allocated properly and the next and prev links are not connected, properly. I can’t use cudaMallocManaged on the device side. Thanks
You won’t be able to do this with all the flexibility you might want, using only the allocators provided by CUDA. The principal reason for this is that the host code allocators (for device memory) and the device code allocators don’t allocate GPU device memory from the same place; they are not interoperable; and this is mentioned in the programming guide.
Subject to those limitations, there are probably several scenarios that could work, but may not be interesting. For example it should be possible to create a linked list on the host, and add to it on the device. But the resultant list will not be traversible reliably any longer in host code.
In a nutshell the device side allocations are not accessible in host code. They are not managed memory, and as you already indicated, there is no managed allocator on the device.
The only solution I can think of is to write your own allocator. And pass that allocator a pointer to a pool of managed memory that it can use. Both host and device can allocate out of this space, and “free” also (i.e. rerturn unused list items to the pool).
Writing your own allocator is an involved task. You can probably find online references for people who have done this in CUDA, but I don’t have anything immediate to suggest.
CUDA 11.2 provided some improved allocation methods, but these don’t provide a full device side interface, as I am suggesting is needed here.
If you can limit the types of activity scenarios you will do with the list, then it might be possible to come up with something that works with basic allocators, but I am doubtful there.
Thank you for a quick and detailed response. I am trying to do something that I saw in another post is impossible. I am close to making it work after overcoming several hurdles. I think I have finally come to the end of the road. I don’t give up, easily, and it is hard for me to let things go. But, I will have to force myself to let this go.
I have been trying to use a priority queue created on the host and pass it to the device. I restricted inserting and deleting nodes to 1 block that has 1 thread because I know it would be awful to use a priority queue with multiple blocks and threads. The priority queue works for the most part. I can execute many methods like size(), empty(), and even insert() and delete(), successfully. I have narrowed everything down to one problem. And, that is the linking of the nodes so that I can traverse it using next and/or previous. Your explanation helps me understand why.
At first, I was using cudaMalloc and cudaMemcpy to pass the priority queue. It worked except for the memory allocation issues. I was using cuda 5.0 at the time. Then, I found out about unified memory so I moved to cuda 6.5 and used cudaMallocManaged. I was amazed on how well the methods executed and worked on the device side. But, again, the only issue is memory allocation to link the nodes together and traverse the list. I don’t know if I can create my own allocator (because of time) so, I will just settle with using the priority queue on the host and pass multiple individual nodes to the device and process them. I believe that will cut my processing time in half, if not more.
Also, please let me know where I can find the programming guide you mentioned.
most cuda documentation can be found here look for programming guide in the column on the left, click on that
Sorry, I just noticed you replied to my request. I didn’t get a notification. Thanks!