It should be possible to read and minimize without doing any sort of parallel reduction sweep, until all the reading of input data is done by the block. Take a look at the code I indicated earlier. It appears that you are doing a full shared memory sweep each time you load a set of 64/128 inputs. This should not be necessary.