I would like to know about which mass of data parallelization becomes useful, I have a software that has few loops, to be honest just 32 and each one only make comparisons, my question is how much data can i have real improvement?
In your very specific case, it’s needless to use the gpu for just 32 comparisons. nowadays GPUs can have hundreds of arithmetic units. Along with pipelining, thousands of operations can be done in few cycles.
In your very simple case : the most important thing is the amount of data you need to transfer between the cpu-gpu: if you send O(n) data to read them once then for sure you’re gonna slowdown your small function.
Thanks a lot my friend.