Shared memory - dynamic allocation

Hi I’d like to ask for some design advice.

I have a device side linked list which is created dynamically and stored in global memory with one list per thread.

This works but turns out to be slow as by the time the list is required it has dropped out of L1 cache.

I’d like to reimplement in shared memory but am struggling on how to dynamically allocate the elements of the list. I can create a shared array to hold a pointer to the head of each linked list for each thread within a block. But when it comes to how to then create the head and subsequent nodes in shared memory I’m at a loss.

I’ve read about the use of extern to dynamically allocate arrays but what I’d like to effectively do is something like

node = new Node() where the new operator is overloaded appropriately and to have the memory allocated in shared memory.

Any thoughts?

Many thanks

I assume there is a tight upper bound on the number of list elements? Shared memory is pretty small, and you state you have a list per thread.

One can build a list on top of an array in straightforward manner, using array indices (probably of type ‘unsigned short int’ in this case) instead of pointers for linkage. Typically one would use a doubly-linked circular list with an anchor node. You would keep a counter for the number of nodes currently in the list to find the next free array element. Is the list only ever added to, or do deletions also occur? If the latter, you would want to do compaction after deleting a node to avoid keeping a free list.

Here is a sketch (written in browser, beware of bugs) of how to build a list of finite maximum size (NBR_NODES) on top of an array:

struct node_t {
   unsigned short prev, next;
   /* payload */

struct list_t {
    struct node_t node[NBR_NODES];
    int nodes_used;

struct list_t list;

    /* create anchor node */
    list.arr[0].prev = 0;
    list.arr[0].next = 0;
    list.nodes_used = 1;

insert (idx) // insert node after list.node[idx]
    int new_node = list.nodes_used;

     /* allocate node */

    /* link the new node */
    list.node[new_node].prev = idx;
    list.node[new_node].next = list.node[idx].next;
    list.node[list.node[idx].next].prev = new_node;
    list.node[idx].next = new_node;

delete (idx) /// remove node after list.node[idx]
    int succ = list.node[idx].next;
    int last_alloced = list.nodes_used - 1;
    /* unlink node */
    list.node[idx].next = list.node[succ].next;
    list.node[list.note[succ].next].prev = idx;

    /* compact: physically move last allocated node into location of deleted node */
    if (succ != last_alloced) {
        int prev = list.node[last_alloced].prev;
        int next = list.node[last_alloced].next;
        list.node[succ] = list.node[last_alloced]; // mode node
        list.node[prev].next = succ;
        list.node[next].prev = succ;

    /* free node */

Thanks for the quick response. That’s made me think about the problem more.

One thing I’m also keen on is to produce code that complies both as plain c++ on a cpu which is important for testing purposes in my case. I already have a tested linked list class that I’m using and I’d prefer not to have to write a new one for the gpu (and any other classes I later decide to allocate with shared memory.

My thoughts now are to pre allocate an array of shared memory that is big enough for all my allocations. I can then treat this as a shared memory heap and write my own malloc and free functions that utilise this memory.

This would allow shared memory to be allocated in just the same way as global memory.

Does this sound sensible or am i missing something fundamental that will stop this approach from working?

If you do not want to be limited to a fixed number of list entries per thread as in my example, you could certainly write a simplistic dynamic allocator along the same lines. Just take care to properly synchronize between the threads in that case, as the shared-memory heap will be shared by all threads.

You may also want to ponder whether a list is the most appropriate data structure for your use case. If there is only little data per thread, a brute-force search in an array could be faster and eliminates the overhead of control information needed for lists or trees. Shared memory is a precious resource.

As for code that’s portable between host and device: That’s not something I have much experience with. I’ll point out that inter-thread synchronization on the CPU and the GPU to control access to a shared heap will likely look quote different, not sure how you plan to abstract that away.