Link List using CUDA

I know the basic concepts on memory management. Thanks for you help in advance:

Any suggestion?

Pretty much any linked list data structure is going to have terrible performance on the device. If you can flatten it into an array before copying to the device, it will be much easier to get good memory bandwidth.

Is it possible for a kernel point to a struct then update each element of my linked list called Zone?

Sure, if you can get it onto the device. Transferring data structures that contain other pointers to the device is very tedious. You will have to cudaMalloc() each node in the list on the device, then ensure that when you cudaMemcpy each node to the device, the address in the next pointer is the device address of the next node, rather than the host address. The bookkeeping will be tricky, though. When you are done, reading the linked list back will equally complicated.

Once on the device, consider how you will access this linked list data structure in parallel. If each thread is operating on a different node, then thread #100 will have to step through 100 pointers to read its node, which is very memory bandwidth inefficient in CUDA.

CUDA programming works much better with flat arrays because each thread knows immediately the address of the element it is operating on, and the memory access pattern naturally looks like large contiguous reads, which work best with the wide memory bus.

Do you have any suggestion as to how I should begin to evole my code to an array? A little kernel coding

Thanks for your patience

How you do this depends on the size and use of the Zone struct.

You can make an array of Zones on the host, copy the linked list contents into that array, cudaMalloc an array on the device of the same size, and then cudaMemcpy() your host array to the device array.

If a Zone struct is not 4, 8, or 16 bytes in size, having each CUDA thread read a different zone will not be as efficient as it could be. (Best memory bandwidth is achieved when you make coalesced reads, and that requires each thread to read elements of size 4, 8, or 16 bytes, with adjacent numbered threads accessing adjacent memory locations.) In that case, you might consider converting your linked list of structs into several arrays, one for each member field in the struct. Allocating each member separately will make it easier to achieve coalescing on the device, with the annoyance that you’ll have to pass around several pointers instead of just one.

A more significant question is whether the overhead of converting a linked list to an array, copying the array to the device, operating on it, then copying the array back to the host still makes the problem worth it. How big is this list? Are there multiple operations you could do to the Zones once they are in device memory before copying them back? Is your goal here to practice some CUDA programming, or to see a meaningful speedup in your current code?

One more thing: if your linked list is very large and you are seeing performance problems already, you should consider whether converting to use an array in the CPU code is easy to do. Using CUDA will be much simpler if the data structure is already in array form.

My goal is to pratice some cuda programming