Reducing Multiple Arrays

I have many arrays of different sizes that all need to be reduced. The number of arrays can get to the hundred thousands. Any ideas how to do this efficiently? Should I make one grid of blocks for every array that needs processing? Or just one block of threads for every array? How do I decide on an optimal number of threads per block/# of blocks? In the reduction whitepaper, it was reported that the best results were obtained with 64-256 blocks of 128 threads with each thread reducing 1024-4096 elements. Would this be true in my case also?

Another question is how I should organize my many arrays. One option is to interleave them within a single huge 1D array and keep track of where each one starts and stops. This may prevent coalesced memory accesses however.

Another option is managing a 2D array where each element in it is a variable sized 1D array that needs to be reduced. Since I will be dynamically allocating these 1D arrays, and cudaMalloc is guaranteed to return a pointer aligned to at least 256 bytes, this can help with coalesced memory accesses but it also causes extra overhead and memory accesses since we’re dealing with a 2D array now. Any opinions?