Sum more of 33.553.920 numbers (max numbers of thread on Tesla S2050)

Hi people, my code sums N numbers in parallel in pairs without repetition and works fine. Now my trouble is when I have more of 33.553.920 numbers (max numbers of optimized thread on Tesla S2050), I do make a single sum for each thread. This is my code:

__global__ void sumVect(QVECT *Vett, QVECT *VettRis, unsigned long int N, unsigned long int N2){

	

        //Dichiarazioni variabili sul device

	int tid, sh, linIdx, i, z, j;

	

	//Calcolo del thread ID

	sh = threadIdx.x+32*threadIdx.y;

	tid = 512 * blockIdx.x + 1048576 * blockIdx.y + sh;

	//Algoritmo per il calcolo degli indici giusti in base al thread

	if (tid<N2){

		linIdx=N2-tid;

		i=int(N - 0.5 - sqrt(0.25 - 2 * (1 - linIdx)));

		z=(N+N-1-i)*i;

		j=tid - z/2 + 1 + i;

		if (i==j){

			i=i-1;

			j=N-1;

		}

	

		//Somma di due quadrivettori

		VettRis[tid]=Vett[i]+Vett[j];

	

}

and call the global function with:

sumVect<<<dim3(2048,32,1),dim3(32,16,1)>>>(QVect_Dev,QVect_Dev_Ris,N,coefBin);

They told me to send to the kernel function a piece of data at a time in a cycle “for”, but I don’t know how to manage data :(. Help me please :( Thanks a lot!

We’ve modified the reduction (sum) function highlighted in the CUDA article by Mark Harris to accommodate more than the 100M+ numbers allowed by the current 2.x compile cards. I’ve discussed it in another thread recently (no pun intended…OK, it was).

It requires two passes, but will sum basically as much data as you can fit in memory.

The total number of threads is 10 times larger than your number. the total size of a block is 650000x65000x64 while the number of hreads in a block is 1024. If you multiply this you get almost 300 milion. If you can not modify the size of the block and grid then you can try to put more than 1 element per thread with a for loop.