Approx Flood fill on NVIDIA 8600 GT

Hi all,
I am trying to implement an approximate floodfill algorithm on NVIDIA 8600 GT. I have divided the 255 X 384 image into strips of 4 X 384 overlapping each other by 1 X 384. The strip of 4 X 384 containing the seedpoint is processed first, I take the topmost line of the strip ( 0th row ) and the bottommost line of the strip ( 3d row) which I store in the shared memory. The first warp is supposed to do this process on 8 images parallelly ( assuming warp size is 8 for 8600 GT), the next warp is supposed to process the strip above the seedpoint strip and I read the stored top line and bottom line of the previous strip to seed the present strip. Thus there are 8 parallel calls ( for 8 images ) and 85 serial calls ( 255 / ( 4 - 1) ).
My question is this. Can I use the shared memory as a storage buffer for serial processing, to share data between one warp to the next warp. I am getting results which are different each time the code is called.

Yes, shared memory can be used to buffer data to be shared between threads in the same block. To avoid race conditions, you must call __syncthreads() between the code that puts the data in shared memory, and the code that reads it.

Also, the warp size on all CUDA GPUs so far has been 32 threads, not 8. There are only 8 stream processors on each multiprocessor, but the 32 threads in a warp are issued at once and pipelined into the stream processors as a unit.