Hello,
I am looking for your advice on how to distribute N elements in a continuous array among K kernels where N >> K and the size of the entire array >> shared memory. I am considering just use 1D grid configuration.
Here are two distribution schemes I am considering. Please let me know if there are better distribution schemes available.

Each kernel handles (i % K) elements, where 0 <= i < N. For examples,
kernel[0] processes elements 0, K, 2K, …
kernel[1] processes elements 1, K + 1, 2K + 1, …
… 
Each kernel handles (N / K) continuous elements in order. For examples,
kernel[0] processes elements 0, 1, …, N/K1
kernel[1] processes elements N/K, N/K + 1, …, 2N/K  1,
…
Considering data locality and “cache” missing, which one is better?
If the load of processing each element is similar, #1 has the benefit of all kernels chewing one continuous chunk of the array at a time. Then these kernels switch to another chunk pretty altogether. If the load of processing an element varys a lot, then #1 would suffer cache miss.
One kernel in #2, on the other hand, has much less chance of depending on the same data chunk with others. But it probably suffers quite miserably from excessive loading from global memory to shared memory.
I personally akin towards #1 based on the above logic. Would you please advise? Would 2D/3D grid configuration help? Thanks so much.