Hello I’m working on implementing a parallel linked data structure and I know that atomic operations and other operations such as CAS can be used to perform update the links concurrently.
My question is, what if I have multiple kernels executing concurrently, will atomic operations still be effective? My intuition says they shouldn’t be limited to work withing a single kernel but I need to confirm it.
Also I’ll pose a second question in case any of you can point me to some direction.
While implementing the same structure on the host, I noticed a performance improvement when I switched to a slab allocation approach for the nodes.
Has any similar work been done for a GPU application? Basically I’m looking for an example allocating a block of memory and each thread being assigned a pointer to a portion of this block. Rather than deallocating memory, I’d rather have the threads recycle the pointers.
Hello I’m working on implementing a parallel linked data structure and I know that atomic operations and other operations such as CAS can be used to perform update the links concurrently.
My question is, what if I have multiple kernels executing concurrently, will atomic operations still be effective? My intuition says they shouldn’t be limited to work withing a single kernel but I need to confirm it.
Also I’ll pose a second question in case any of you can point me to some direction.
While implementing the same structure on the host, I noticed a performance improvement when I switched to a slab allocation approach for the nodes.
Has any similar work been done for a GPU application? Basically I’m looking for an example allocating a block of memory and each thread being assigned a pointer to a portion of this block. Rather than deallocating memory, I’d rather have the threads recycle the pointers.
linked data structure? That doesn’t sound like something good for the GPU’s memory architecture because when your threads do not access consecutive elements in the chain the memory accesses will not be coalesced.
Atomic operations, for CC2.x, are directly done on L2, which is shared by all streaming multiprocessors. So I suppose even if the operation is from different concurrent kernels in the same context it would still work.
I understand that linked structures are not very good for the GPU architecture. What I’m trying to do in implement a parallel structure which stores (key,value) pairs in sorted order. Basically the same functionality of STL map and multimap but I’d like to do insertions and deletions of multiple keys in parallel.
I already tried implementing it with thrust::device_vectors as an underlying container + binary search but the whole memory shifting that has to occur proved to be very slow.
Now I’m implementing this structure as a skip-list.
Does anyone have any suggestions on a good “dictionary” type structure that stores elements sorted by key? (not a hash map, even though they provide fast lookup and insertion the data is not sorted).
skip-list is good for multi-CPUs. It doesn’t work for GPU if most of the insertion/deletions are done on non-consecutive elements.
ABCD
A(A1)B(B1)C(C1)D(D1) this kind of insertion pattern would be excellent. As long as ABCD appear consecutive on the same layer it will be good, so the choice of nodes would be important. But you want things to be sorted, so this pattern is not going to appear.
It really depends on the size of your dataset, the frequency of insertion/deletion as well as the intensity of the arithmetic operation you do on the elements.