Shared memory - dynamic allocation

John_Smith_Lon · November 28, 2017, 1:06pm

Hi I’d like to ask for some design advice.

I have a device side linked list which is created dynamically and stored in global memory with one list per thread.

This works but turns out to be slow as by the time the list is required it has dropped out of L1 cache.

I’d like to reimplement in shared memory but am struggling on how to dynamically allocate the elements of the list. I can create a shared array to hold a pointer to the head of each linked list for each thread within a block. But when it comes to how to then create the head and subsequent nodes in shared memory I’m at a loss.

I’ve read about the use of extern to dynamically allocate arrays but what I’d like to effectively do is something like

node = new Node() where the new operator is overloaded appropriately and to have the memory allocated in shared memory.

Any thoughts?

Many thanks

njuffa · November 28, 2017, 1:31pm

I assume there is a tight upper bound on the number of list elements? Shared memory is pretty small, and you state you have a list per thread.

One can build a list on top of an array in straightforward manner, using array indices (probably of type ‘unsigned short int’ in this case) instead of pointers for linkage. Typically one would use a doubly-linked circular list with an anchor node. You would keep a counter for the number of nodes currently in the list to find the next free array element. Is the list only ever added to, or do deletions also occur? If the latter, you would want to do compaction after deleting a node to avoid keeping a free list.

Here is a sketch (written in browser, beware of bugs) of how to build a list of finite maximum size (NBR_NODES) on top of an array:

struct node_t {
   unsigned short prev, next;
   /* payload */
};

struct list_t {
    struct node_t node[NBR_NODES];
    int nodes_used;
};

struct list_t list;

init()
{
    /* create anchor node */
    list.arr[0].prev = 0;
    list.arr[0].next = 0;
    list.nodes_used = 1;
}

insert (idx) // insert node after list.node[idx]
{
    int new_node = list.nodes_used;

     /* allocate node */
    list.nodes_used++;

    /* link the new node */
    list.node[new_node].prev = idx;
    list.node[new_node].next = list.node[idx].next;
    list.node[list.node[idx].next].prev = new_node;
    list.node[idx].next = new_node;
}

delete (idx) /// remove node after list.node[idx]
{
    int succ = list.node[idx].next;
    int last_alloced = list.nodes_used - 1;
 
    /* unlink node */
    list.node[idx].next = list.node[succ].next;
    list.node[list.note[succ].next].prev = idx;

    /* compact: physically move last allocated node into location of deleted node */
    if (succ != last_alloced) {
        int prev = list.node[last_alloced].prev;
        int next = list.node[last_alloced].next;
        list.node[succ] = list.node[last_alloced]; // mode node
        list.node[prev].next = succ;
        list.node[next].prev = succ;
    }

    /* free node */
    list.nodes_used--;     
}

John_Smith_Lon · November 28, 2017, 4:57pm

Thanks for the quick response. That’s made me think about the problem more.

One thing I’m also keen on is to produce code that complies both as plain c++ on a cpu which is important for testing purposes in my case. I already have a tested linked list class that I’m using and I’d prefer not to have to write a new one for the gpu (and any other classes I later decide to allocate with shared memory.

My thoughts now are to pre allocate an array of shared memory that is big enough for all my allocations. I can then treat this as a shared memory heap and write my own malloc and free functions that utilise this memory.

This would allow shared memory to be allocated in just the same way as global memory.

Does this sound sensible or am i missing something fundamental that will stop this approach from working?

njuffa · November 28, 2017, 5:26pm

If you do not want to be limited to a fixed number of list entries per thread as in my example, you could certainly write a simplistic dynamic allocator along the same lines. Just take care to properly synchronize between the threads in that case, as the shared-memory heap will be shared by all threads.

You may also want to ponder whether a list is the most appropriate data structure for your use case. If there is only little data per thread, a brute-force search in an array could be faster and eliminates the overhead of control information needed for lists or trees. Shared memory is a precious resource.

As for code that’s portable between host and device: That’s not something I have much experience with. I’ll point out that inter-thread synchronization on the CPU and the GPU to control access to a shared heap will likely look quote different, not sure how you plan to abstract that away.

Topic		Replies	Views
Using shared memory in device function and allocate required shared memory in global function CUDA Programming and Performance	2	32	April 14, 2025
How to create a dynamic size array in device? CUDA Programming and Performance	6	3532	August 26, 2008
How Can You Created A Linked List Using Unified Memory And Add And Delete Nodes On The GPU CUDA Programming and Performance	5	552	July 28, 2022
"deallocating" shared memory CUDA Programming and Performance	8	2125	December 4, 2018
Shared memory using structure instead of array CUDA Programming and Performance	7	1327	February 29, 2020
Dynamic Shared memory CUDA Programming and Performance	3	6103	June 4, 2009
Dynamic Allocation of Shared Memory CUDA Programming and Performance	0	6743	February 4, 2011
beginner question regarding shared memory CUDA Programming and Performance	4	6922	November 16, 2009
Dynamic Heap initialization CUDA Programming and Performance	12	374	June 24, 2024
Where best to allocate memory On the local stack or in shared memory CUDA Programming and Performance	11	5430	January 26, 2009

Shared memory - dynamic allocation

Related topics