Shared queue

I need to implement a shared queue on CUDA. It needs to be shared by all threads, so it will probably need to be implemented in global memory. Does anybody know of any good implementations already in use for a shared queue on CUDA?


1 Like

This kind of data structure is very hard to construct in CUDA and is rather inefficient. Can you redefine your program to remove the need for a shared queue?

I’m new to CUDA so a “shared queue free” solution is not immediately obvious to me. I have a set of shared data that each thread operates on independently with a single given input. i.e. each thread operates on one element of an array given a single read only input. Then, each thread’s result (if it produces a result) needs to be added to the queue and used as an input in a subsequent iteration of a while loop, which runs until the queue is empty. Now I could just store the result in a local queue for each thread, but then coordinating that becomes a mess.

This sounds like an input array, an output array, and an “output generated” flag array. You might need to follow this output with a compaction step to remove the unused slots from the output array.

Hm… that seems like something that I should be able to work with, I’ll look into that. Thanks.

I implemented the enqueue I hope this can help u. if u find any solution for the makeququ plz provide it here in this forum.

#define MaxSize 10

typedef struct Queue


int *QItems;

int *index;

}PriQueue q;

shared int localQ[MaxSize];

shared int localQ_index;

shared int globalQ_index;

global void makequeue(){


global void dequeue (){


global void enqueue (int newnode){

//insert new node in the local queue

int index=atomicAdd(&localQ_index,1);


//thread 0 obtain the index of the global queue

if (threadIdx.x==0)



//copy the local queue inot the global queue (other kernels can see)

if (threadIdx.x<localQ_index)