 # Memory usage problem

This is actually a mix of a cuda-problem and a mathematical problem but I need to explain the underlying mathematical issue for it to make sense.

I have run into a problem to which I really would appreciate some input. I want to calculate the correlation between two variables X={x1,x2,x3,…,xN} and Y={y1,y2,y3,…,yN} . The samples are quite big so N can be > 300000.
Each block Im using consists of 32 threads. So my goal is to be able to do 1/32th of the correlation calculation in each thread and after combine the results from the 32 threads into the actual correlation. This correlation-calculation is only a part of the functionality in the kernel so I am bit restricted because I “can’t” change the overall design ( without lot of work ).

X and Y are actually graphs and my main problem is that the values are cumultive. For example the “point” y3 = y1+y2+y3. So the new value gets added to the previous values basically.

The easiest way mathematically ( from what I know ) to calculate the correlation between X and Y is: corr = ( N * SumXY - (SumX *SumY)) / sqrt(((N * SumXX - (SumX * SumX)) * (N * SumYY - (SumY * SumY)));

My problem now is that the threads depend on each other. For example in thread 2 I need to know the final SumY value from thread 1 to be able to continue calculate SumY.

The simple sums should be doable to calculate but the squares seems very hard. And if I need input from thread_n to thread_n+1 then the code is basically serial which isnt very good.

One approach I thought about was to save all y_i in an array in global memory and then use that to combine the values after. This turned out to be a bad idea because each block will write to that global memory and it will be chaos. So I would need an array per block. But Im using 2048 blocks and each block would need an array of floats of size >300000. That would need some serious amount of memory and seems grossely inefficient.

Anyone have any suggestion on how I can solve this problem? Perhaps this is not possible to do in a parallell way( without using out of proportion memory )?