CUDA traversal of quad tree using stack push and pop

Each thread in my block traverse the quad tree based on different conditions. So I want each thread to maintain its array of nodes in shared memory separately. Suppose I create a dynamic shared memory for each thread as below,

volatile int *ptr;
extern shared NODE smem_children;

for (int i = 0 ; i < blockDim.x ; ++i)
	ptr[i] = (volatile int *) &smem_children[i*64];

Where ptr is the pointer used to divide my dynamic shared memory into memory for each thread. (I am not sure about this, but i believe this is how I could do this)
How do I store the nodes into this ? I could not just understand how I could index these separate arrays to store my nodes.
And after storing these nodes, once the nodes are leaf nodes, I would like to pop them and evaluate them.
Can anyone give a code sample for pushing and popping nodes into and from a stack that is in the shared memory? any help or any suggestion regarding this would be much appreciated. Thanks in advance.

how many nodes per thread (at max)?
the data type and how much data per node?

Oh God, this sounds like it’s going to require a lot of global loads. Abort ship! Abort ship!

Oh God, this sounds like it’s going to require a lot of global loads. Abort ship! Abort ship!