oh, you are right. For some reason I was looking at the reduce_dynamic kernel %-)
I am surprised then that you do a fixed amount of work in the reduce kernel and launch a problem-dependent number of blocks. I’d do it other way around - I’d launch a fixed number of blocks that would run for a problem-dependent amount of iterations. This is the common way to do the work on the CPU - you simply divide the total work between, say, 4 or 8 threads if running on quadcore. The advantage is that most of the summation is done sequentially. I guess the problem with this approach is that you’d have to tune the optimal number of blocks for every GPU individually. Also, I doubt it would run faster than your current code :)
oh, you are right. For some reason I was looking at the reduce_dynamic kernel %-)
I am surprised then that you do a fixed amount of work in the reduce kernel and launch a problem-dependent number of blocks. I’d do it other way around - I’d launch a fixed number of blocks that would run for a problem-dependent amount of iterations. This is the common way to do the work on the CPU - you simply divide the total work between, say, 4 or 8 threads if running on quadcore. The advantage is that most of the summation is done sequentially. I guess the problem with this approach is that you’d have to tune the optimal number of blocks for every GPU individually. Also, I doubt it would run faster than your current code :)
Yes, this is the “trick” i use in this implementation to be able to unroll those for loops ( ie i want to know the number of iterations at compile time ). The final reduce_dynamic just takes care of the scraps, which i called the “tail” in the code.
The “tail” is very small in relation to the whole problem which means that the runtime of the ineffective reduce_dynamic kernel becomes insignificant ( there might be room for tweaking here though).
Yes, this is the “trick” i use in this implementation to be able to unroll those for loops ( ie i want to know the number of iterations at compile time ). The final reduce_dynamic just takes care of the scraps, which i called the “tail” in the code.
The “tail” is very small in relation to the whole problem which means that the runtime of the ineffective reduce_dynamic kernel becomes insignificant ( there might be room for tweaking here though).
I’m so retarded, who calls the remainder “tail”??? idjit! :)
Anyways if you treat the remainder explicitly you can for example have predefined block sizes ( how many elements each block processes) which may be computed much faster. If the remainder is small enough it’s compute time will be relatively small even if this is done less efficient.
I’m so retarded, who calls the remainder “tail”??? idjit! :)
Anyways if you treat the remainder explicitly you can for example have predefined block sizes ( how many elements each block processes) which may be computed much faster. If the remainder is small enough it’s compute time will be relatively small even if this is done less efficient.
If this works it should be slitghtly more effective and i guess one could fix #blocks to some optimal number as you suggested. I must admit i wasn’t aware that one could unroll that :)
Will give it a spin later!
The only objection that comes to mind is that guess if one is unlucky the remainder will be something like 255 els * #blocks which is larger than the original remainder?