reduction centric design forces. should Iconsider atomic increment rather than classic reduction

I have a question about the use of cuda where a data reduction is required. Imagine the production of a matrix of objects is necessary. A[m][n]
A[1,1], A[2,1] … A[m,1] are computed by thread block 1, where the thread block has m threads in it. The second thread block then
Does A[1,2], A[2,2] … A[m,2] After this basic work, I need to sum reduce all the elements in each thread block into the first element, leaving
A[1,1] A[1,2]… A[1,n] elements

With this basic picture in mind, imagine that each A[I,j] object, rather than a single element, is an array of doubles, where the array is an array of structs and the length of the array is 1440, and the struct contains three doubles. The reduction necessary for each thread block is done separately for each of the 1440 * 3 doubles. So each thread block, when done, after the reduction, results is a single object for that thread block, an array or length 1440 of structs, each struct has 3 doubles.

A thread block needs to have a scratch area for each thread to work independently so they don’t collide, and this scratch area, for each thread block, is the length of the thread block, m, times 1440 * 3 * sizeof(double), and there are n thread blocks so the scratch area for this kernel launch is m times n times 1440 times 3 times sizeof(double).
Fore each thread block, the reduction phase takes this scratch area, and performs the reduction into the first element, and the kernel is done, but the resulting scratch area must be further reduced after the kernel is finished , into a global of size n times 1440 * 3 * sizeof(double). Many such kernel launches are performed until the data is exhausted. The problem is that the size of the memory required for each kernel launch keeps the size of n and m down, so I can get enough device scratch memory to do the kernel launches . This procedure is not fast. I am wondering if I could just use atomic add/increment right into the global accumulator. Would this be an insane option? The scratch memory is way way too big for shared memory cause m * n * 1440 * 3 * sizelf(double) keeps m and n close to 1. Atomic operations right into the global accumulator for all thread blocks in a kernel launch would be much slower than the designed outlines above, right? I can’t see another way.