Sum more of 33.553.920 numbers (max numbers of thread on Tesla S2050)

local_hero · December 14, 2011, 10:19am

Hi people, my code sums N numbers in parallel in pairs without repetition and works fine. Now my trouble is when I have more of 33.553.920 numbers (max numbers of optimized thread on Tesla S2050), I do make a single sum for each thread. This is my code:

__global__ void sumVect(QVECT *Vett, QVECT *VettRis, unsigned long int N, unsigned long int N2){

	

        //Dichiarazioni variabili sul device

	int tid, sh, linIdx, i, z, j;

	

	//Calcolo del thread ID

	sh = threadIdx.x+32*threadIdx.y;

	tid = 512 * blockIdx.x + 1048576 * blockIdx.y + sh;

	//Algoritmo per il calcolo degli indici giusti in base al thread

	if (tid<N2){

		linIdx=N2-tid;

		i=int(N - 0.5 - sqrt(0.25 - 2 * (1 - linIdx)));

		z=(N+N-1-i)*i;

		j=tid - z/2 + 1 + i;

		if (i==j){

			i=i-1;

			j=N-1;

		}

	

		//Somma di due quadrivettori

		VettRis[tid]=Vett[i]+Vett[j];

	

}

and call the global function with:

sumVect<<<dim3(2048,32,1),dim3(32,16,1)>>>(QVect_Dev,QVect_Dev_Ris,N,coefBin);

They told me to send to the kernel function a piece of data at a time in a cycle “for”, but I don’t know how to manage data :(. Help me please :( Thanks a lot!

MattWarmuth · December 15, 2011, 12:22pm

We’ve modified the reduction (sum) function highlighted in the CUDA article by Mark Harris to accommodate more than the 100M+ numbers allowed by the current 2.x compile cards. I’ve discussed it in another thread recently (no pun intended…OK, it was).

It requires two passes, but will sum basically as much data as you can fit in memory.

pasoleatis · December 15, 2011, 12:53pm

Hi people, my code sums N numbers in parallel in pairs without repetition and works fine. Now my trouble is when I have more of 33.553.920 numbers (max numbers of optimized thread on Tesla S2050), I do make a single sum for each thread. This is my code:
__global__ void sumVect(QVECT *Vett, QVECT *VettRis, unsigned long int N, unsigned long int N2){

	

        //Dichiarazioni variabili sul device

	int tid, sh, linIdx, i, z, j;

	

	//Calcolo del thread ID

	sh = threadIdx.x+32*threadIdx.y;

	tid = 512 * blockIdx.x + 1048576 * blockIdx.y + sh;

	//Algoritmo per il calcolo degli indici giusti in base al thread

	if (tid<N2){

		linIdx=N2-tid;

		i=int(N - 0.5 - sqrt(0.25 - 2 * (1 - linIdx)));

		z=(N+N-1-i)*i;

		j=tid - z/2 + 1 + i;

		if (i==j){

			i=i-1;

			j=N-1;

		}

	

		//Somma di due quadrivettori

		VettRis[tid]=Vett[i]+Vett[j];

	

}
and call the global function with:

sumVect<<<dim3(2048,32,1),dim3(32,16,1)>>>(QVect_Dev,QVect_Dev_Ris,N,coefBin);

They told me to send to the kernel function a piece of data at a time in a cycle “for”, but I don’t know how to manage data :(. Help me please :( Thanks a lot!

The total number of threads is 10 times larger than your number. the total size of a block is 650000x65000x64 while the number of hreads in a block is 1024. If you multiply this you get almost 300 milion. If you can not modify the size of the block and grid then you can try to put more than 1 element per thread with a for loop.

Topic		Replies	Views
Sum vectors CUDA Programming and Performance	24	5173	December 21, 2011
Sum of N numbers in parallel in pairs without repetition. CUDA Programming and Performance	23	2749	December 20, 2011
Urgent help with threads please! CUDA Programming and Performance	21	10860	March 6, 2008
sequential sum within a kernel. CUDA Programming and Performance	23	5075	September 8, 2008
Summing threads CUDA Programming and Performance	3	3177	June 7, 2011
computing a sum leads to infinite values CUDA Programming and Performance	3	5397	September 16, 2008
Array Sum in cuda CUDA Programming and Performance	5	11520	May 30, 2010
Sum of a subvector CUDA Programming and Performance	7	1896	June 17, 2009
Problem with sum of two structs without ripetition CUDA Programming and Performance	1	549	January 29, 2012
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6544	April 13, 2009

Sum more of 33.553.920 numbers (max numbers of thread on Tesla S2050)

Related topics