About reduction About reduction performance VS occupancy

I have a huge problem I’m coding an iterative algorithm that need to check at each iteration if all threads have converged
For this purpose I have use several methods and profile performances of my overall algorithm with each method
1)To begin I used a naive approach : one CUDA global memory convergence flag of 1*sizeof(char) that every threads of my kernel write concurrently if they’ve not converge at one iteration. Performances were quite good but not enough occupancy …
2)I read about powerfull reduction algorithms in the CUDA SDK (see http://developer.download.nvidia.com/compu…c/reduction.pdf ), so i tried these using a per thread convergence flag…Good occupancy(depending on blocksize) but less performances than the first method for low number of elmts (<65536 with my bench if i remember)…:(
3)I started asking me questions about my blocksize and discover thrust that seems(i just overview the code) to perform hardware adaptation of the reduction algorithm with respect to GPU specific max nb threads_per SM and max blocks per SM properties. But with the thrust::reduce method , time of computation is double as with the method 2) but occupancy is 1
It drove me completely mad…:argh:

Have you an idea about what happen?

Is the quest of occupancy penalize fataly performances ;’( ?

Good afternoon,

I wish I could help, but unfortunately have not got the answer, but was wondering if you could provide code for your first try, good for less than 65535 elements, as the reduction example given in CUDA SDK is only good with many elements, and I only work with max 7500 elements… so I would be very greatful to be able to try your version!!!

Thanks in advance! If by chance you would prefer to send via email, my email is dlisicre@gmail.com

Kind regards,

David Lisin

I use reduction just in order to check if all my thread convergence flags are 1 (converged) so I don’t really need the sum of the vector: my naive version 1) is not a real reduction the algo seems like it:

//init

h_converged=0

//iterative loop

while(h_converged!=1){

cudamemset(d_converged,1,1);//init converged to true

kernelinvocation(nbelment,_d_converged){

 ...if current thread not converged{

           d_converg[i]=0;//concurent write access to global memory but this is my best method????!!!

   }

}

cudacopyDevicetoHost(h_converged,d_converged,sizeof(char));

}

my second version seems like this:

h_converged=0;

//iterative loop

while(h_converged!=nbelment){

kernelinvocation2(nbelment,_d_converged){

 ...if current thread i not converged{

           d_converg[i]=0;

   }else

 d_converg[i]=1;

}

h_converge=kernelinvocationReduction(nbelmt,d_converg); //add reduction

}

I hope you’ll understand this pseudocode…

I would like to do other thing to optimize my implementation but i don’t understand why reduction is less effective than my global memory concurrent write accesses…

NB:The algorithm I’m coding is the Fast Iterative Method, on my working case I have a max of nbthreads^(1/2)*2 concurrent write accesses occuring per iteration

Hi,

maybe this is not issue here, but I suppose if you only need to check that all threads have a flag set to 1
you can try using warp vote functions like ‘_all’ inside your kernel (unless you’re already using them)
to check your flag across all threads of a warp, and then reduce warp results once again to get the final flag…