Explanation of CUDA CdpQuatree - sample code by NVIDIA using dynamic parallelism

Hello , I am trying to understand the code “cdpQuadtree” , a sample code by NVIDIA written using dynamic parallelism: But I am unable to understand certain parts of the kernel code, “build_quadtree_kernel” and I am stuck. Can I get a complete explanation of this code anywhere? at least the kernel part? The comments in code isn’t sufficient for me. These are the parts I don’t understand,

.In this part of Kernel code,

extern shared int smem;
volatile int *s_num_pts[4];
for (int i = 0 ; i < 4 ; ++i)
s_num_pts[i] = (volatile int *) &smem[i*NUM_WARPS_PER_BLOCK]; "

I understand that smem is a dynamic shared memory and s_num_pts is a pointer used to divide the memory space into 4 , each for one child.But why is s_num_pts[0]=smem[0] , s_num_pts[1] = smem[4] , …s_num_pts[3] = smem[8] (Assuming NUM_WARPS_PER_BLOCK = 4). I don’t understand why the smem is divided in this fashion.

And I don’t understand this part of the code,
If there are 4 warps in a block,

if (lane_id == 0)
s_num_pts[0][warp_id] = 0;
s_num_pts[1][warp_id] = 0;
s_num_pts[2][warp_id] = 0;
s_num_pts[3][warp_id] = 0;
Then are they setting s_num_pts[0][0],s_num_pts[0][1] , s_num_pts[0][2] ,s_num_pts[0][3] and the rest of the pointers to zero? If so, why? And why should lane_id(the index of thread in a warp) be zero to reset counts of points per child?

And in this part of the code,

if (num_pts > 0 && lane_id == 0)
s_num_pts[0][warp_id] += num_pts;

Again, I dont understand why “lane_id == 0” and why s_num_pts is s_num_pts[0][warp_id] ?

I am hoping, understanding these parts will help me understand rest of the kernel code. If someone knows the explanation for these, please help.Or if you could give me a link to where there is an explanation for the code, that would be great.