help with reduction method image processing

I’m trying to write a kernel to take a binary image and extract a list of links between neighboring ON pixels in a large 2D array. The ON pixels are sparse. The total number of possible links is large (9 links per pixel * 8million pixels) and is too large to fit in global memory. My solution was to make an array of arrays, with the first dimension equal to the number of blocks, then allocate the number of links per block dynamically. The number of links per block is calculated in one kernel which executes over the entire image, then the links are extracted in a second kernel. The counting code is fast because all the counting is in shared memory and I use a loop at the end of the kernel to add up the results in shared memory without divergence or bank conflicts, following the reduction example code. However, I do not see how to use a similar technique to extract the link data. The way I do it now is to scan the results array in shared memory through all threadID values, only within the thread that has ID zero, and copy the results into global memory. This obviously causes a performance hit, since one thread has to loop through the results from all the threads. I realize this is a stupid way to get the results into global memory. Does anyone see a way to collapse the results in multiple threads while avoiding conflicts?

Here is the code - link is a struct with two unsigned integer members representing the pixel numbers of the two neighbors. links is an array of arrays which has been allocated to the proper size to hold the results, as counted in a separate kernel. The thread block is one dimensional while the grid is 2D. n_links is an array of dimension gridDim.x*gridDim.y, and initialized to zero before launching the kernel. The actual code looks at all 8-connected neighbors but to simplify the code while still demonstrating the problem I’d like to solve, I’ve only included one neighbor. The code gives correct results but should be faster.

__global__ void extractLinks(unsigned int *img, link **links, int *n_links) {

  extern __shared__ unsigned int sdata[];

  int *NODES = (unsigned int*)sdata;

  int *EDGES = (unsigned int*)&NODES[blockDim.x];

 unsigned int tid = threadIdx.x;

  unsigned int bid = blockIdx.x+blockIdx.y*gridDim.x;

 const int yi = blockIdx.y;

  const int xi = blockIdx.x*blockDim.x+threadIdx.x;

  const int p = yi*W+xi;

 if (img[p]) {

//assign node and edge for pixel p, putting a nonzero value in EDGES[tid] if pixel p+1 is also nonzero

    int q = p+1;

    if (img[q]) {

      NODES[tid] = p;

      EDGES[tid] = q;

    }

  }

//this is the bottleneck I want to avoid

  if (tid == 0) {

    for (unsigned int nd = 0; nd < blockDim.x; nd++) {

      if (EDGES[nd]) {

        links[bid][n_links[bid]].NODE = NODES[nd];

        links[bid][n_links[bid]].EDGE = EDGES[nd];

        n_links[bid]++;

      }

    }

  }

}