The title pretty much sums it up–how would I go about taking shared-memory-sized pieces of code out of a large array, calculate that piece, return it, and start on the next one?
Just launch a 1D grid contain 4096 blocks of 8x8x8 threads. Each block ID can be resolved into a logical starting position for its corresponding 8x8x8 block in 3d space, and the 8x8x8 shared memory kernel you have just runs with a global memory load offset.