summing multiple arrays into one array and some other questions

Hi, I am wondering about this for some time now, how would be best to sum multiple arrays into one array.

Currently i have some huge array (length of n*m), where there are n-arrays of length m. Something like this: A1,A2,A3,…,An, where each of them is of length m (m%2 is not necessary 0).
I would like to sum all of these n-arrays into a single array of length m.
Each thread is working with 1 array, and some of these arrays are somewhat mixed (because of coalesced writes).

At this moment when I am summing arrays on my CPU, I need to transfer about 27MB of data
(if i calculated correctly: 603236404 [blocksthreadsoneArrayLengthsizeof(float)], which is for some reason dramatically slowing down my program :mellow: )
from device to computer and then reseting it again (filling arrays with zeros), so instead I would rather transfer 3640*4B of data than 55MB.

I saw reduction example, which has great speed-ups, but would this example be appropriate for my problem?
In my current version where arrays are somewhat mixed, reduction example would not be using coalesced reads, because first indexes of arrays are not sequential, nor are others.
I could of course wrote a program, that would rearrange my original array into form A1[1], A2[1], A3[1],…,An[1], A1[2],A2[2],A3[2],…,An[2],…,A1[m],A2[m],A3[m],…An[m]
and then use reduction on subsets of these arrays, which would sum values of (A1[1], A2[1], A3[1],…,An[1]) and save it to first field of output array, and so on.
But would this be still effective? (each sub-array in this example has 3640 elements - in reduction example there are 1048576 elements)
Note: I would also had to pad my arrays to 2^x length

Other questions:
If every thread would want to access global memory on the same location, that would be probably inefficient, right?
Is it better to use texture fetching or coalesced reading from global memory?
In the following function: device void blabla() { float a; }, would variable ‘a’ be stored in register, shared memory or anywhere else?

Hopefully someone can answer to my questions and thanks in advance

PS: sorry for my-not-so-good-english :">