shift data performance question

say the data is 8k, and running 1k fft/reduction etc from start of data, but I want to run 3 time, every time the input data to fft etc is shift by 512 from start of 8k. for best performance and easy mem map, should i use for loop 3time to call fft and fftplan1d, then map each output to some 3x1k ram or is there easy or better performance way to do via fftplanmany and other method.

also anyway to tell cufft zero pad size without actually zero pad in mem/fft