I have a huge problem I’m coding an iterative algorithm that need to check at each iteration if all threads have converged
For this purpose I have use several methods and profile performances of my overall algorithm with each method
1)To begin I used a naive approach : one CUDA global memory convergence flag of 1*sizeof(char) that every threads of my kernel write concurrently if they’ve not converge at one iteration. Performances were quite good but not enough occupancy …
2)I read about powerfull reduction algorithms in the CUDA SDK (see http://developer.download.nvidia.com/compu…c/reduction.pdf ), so i tried these using a per thread convergence flag…Good occupancy(depending on blocksize) but less performances than the first method for low number of elmts (<65536 with my bench if i remember)…:(
3)I started asking me questions about my blocksize and discover thrust that seems(i just overview the code) to perform hardware adaptation of the reduction algorithm with respect to GPU specific max nb threads_per SM and max blocks per SM properties. But with the thrust::reduce method , time of computation is double as with the method 2) but occupancy is 1
It drove me completely mad…
Have you an idea about what happen?
Is the quest of occupancy penalize fataly performances ;’( ?