Im trying to copy a linked list on to the GPU. Lets say I have a simple struct called Cuda. Can I make an array for Cuda pointers and malloc them on to the GPU?
By definition, pointers point to specific memory locations. A pointer in your host code will point to a location in main memory while a pointer in your device code will point to a location in device memory. Transferring a host pointer to the device or vice versa would be meaningless (and would almost certainly do bad things if you tried to use it). You have to transfer the actual objects. I know it’s possible to transfer a struct, but I failed when I tried a while back and didn’t try again, so I can’t help you there.
As far as linked lists go, the same should apply for the next pointers being invalidated with a transfer. You’ll have to put them into an array first, transfer them to the GPU, and then recreate the linked list, although at that point, you’d probably be better off keeping it as an array on the GPU.
Your information was very helpful. So the CPU linked list are essentially nodes that pointer to the next node. If I understand what you are saying correctly then I should turn my linked list into an array. The only problem with is that I will have a loop iteration dependence issue. Would you agree that implement a linked list on the GPU is almost impossible considering the restriction that comes with the GPU. I feel that dyanamic problems don’t map well to the GPU.
Linked lists on the GPU are possible, they might however not be worth the overhead of creation and pointer management (after all, device memory has to be managed from the host side mostly). And there’s a runtime overhead of dereferencing all the ->next pointers at run time.
One of the next CUDA SDK versions will hopefully add new and delete operators on the GPU (for Fermi only), then this entire concept might work a little better.
Linked lists on the GPU are possible, they might however not be worth the overhead of creation and pointer management (after all, device memory has to be managed from the host side mostly). And there’s a runtime overhead of dereferencing all the ->next pointers at run time.
One of the next CUDA SDK versions will hopefully add new and delete operators on the GPU (for Fermi only), then this entire concept might work a little better.
Even if you go through the trouble to make it work, linked lists are fundamentally a terrible data structure for parallel access. (And way overused because it is the first non-trivial data structure taught in CS classes.) Reading the n’th element from a linked list is an O(n) operation, immediately destroying the performance of most any parallel algorithm operating on such a list. Reading from an array is O(1).
Yeah, sure, linked lists are O(1) insert and remove, but there are fewer apps than you think which depend on this performance. If your elements are very large, so insertion is tremendously expensive, then use an array of pointers to objects on the heap, which is still O(1) to random access. (This is how Python implements the “list” data structure.) Same solution works if your objects are irregular in size. Building an array of pointers on a CUDA device is still a little bit of a hassle (requiring N+1 calls to cudaMalloc and cudaMemcpy), but far easier to code than allocating a linked list.
Also, linked lists (and arrays of pointers) containing nodes with only a few simple data types (like say, a linked list of integers) can have terrible cache behavior on CPUs and GPUs. The cache line on a Core i7 is 64 bytes, and on Fermi I believe it is 128 bytes. If the your malloc implementation is returning memory blocks aligned to intervals bigger than your node contents, you are going to be wasting some fraction of each cache line on non-data when you read later. I point out the Core i7 case because even on CPUs, large linked lists are not cache-friendly. A real array, however, will make 100% use of each cache line.
Now, I’m not saying that linked lists are completely useless. For thread-private data (i.e., no parallel access) and maybe in cases where you are basically using the linked list as a queue or stack (only accessing the first or last element) a case can be made. It’s hard to imagine anything that is totally useless, except perhaps BogoSort, but if your solution to a problem is a linked list, it is worth pausing to reconsider.
So, in summary, I will be getting a T-shirt: “Just say ‘no’ to linked lists.” :)
Even if you go through the trouble to make it work, linked lists are fundamentally a terrible data structure for parallel access. (And way overused because it is the first non-trivial data structure taught in CS classes.) Reading the n’th element from a linked list is an O(n) operation, immediately destroying the performance of most any parallel algorithm operating on such a list. Reading from an array is O(1).
Yeah, sure, linked lists are O(1) insert and remove, but there are fewer apps than you think which depend on this performance. If your elements are very large, so insertion is tremendously expensive, then use an array of pointers to objects on the heap, which is still O(1) to random access. (This is how Python implements the “list” data structure.) Same solution works if your objects are irregular in size. Building an array of pointers on a CUDA device is still a little bit of a hassle (requiring N+1 calls to cudaMalloc and cudaMemcpy), but far easier to code than allocating a linked list.
Also, linked lists (and arrays of pointers) containing nodes with only a few simple data types (like say, a linked list of integers) can have terrible cache behavior on CPUs and GPUs. The cache line on a Core i7 is 64 bytes, and on Fermi I believe it is 128 bytes. If the your malloc implementation is returning memory blocks aligned to intervals bigger than your node contents, you are going to be wasting some fraction of each cache line on non-data when you read later. I point out the Core i7 case because even on CPUs, large linked lists are not cache-friendly. A real array, however, will make 100% use of each cache line.
Now, I’m not saying that linked lists are completely useless. For thread-private data (i.e., no parallel access) and maybe in cases where you are basically using the linked list as a queue or stack (only accessing the first or last element) a case can be made. It’s hard to imagine anything that is totally useless, except perhaps BogoSort, but if your solution to a problem is a linked list, it is worth pausing to reconsider.
So, in summary, I will be getting a T-shirt: “Just say ‘no’ to linked lists.” :)