Best way to find many minimums

It should be possible to read and minimize without doing any sort of parallel reduction sweep, until all the reading of input data is done by the block. Take a look at the code I indicated earlier. It appears that you are doing a full shared memory sweep each time you load a set of 64/128 inputs. This should not be necessary.