I am trying to optimalize my code as much as possible. In my code at some point I have several prefix scans to perform. The arrays are of size up to ~4000 each, and in given iteration they have the same size. In short code looks something like this:
while (some_condition) {
[...]
prefixSum(array1,size);
prefixSum(array2,size);
prefixSum(array3,size);
prefixSum(array4,size);
prefixSum(array5,size);
prefixSum(array6,size);
[...]
}
prefixSum is a host function which then calls 1 kernel if array is smaller than 512 or more kernels for bigger arrays.
According to profiler, each kernel of prefix sum takes about 4 microseconds to compute, but there are over 16 microseconds wasted for calling the kernel. Then I thought, how about writing one kernel which would perform the same operation on two arrays at once:
while (some_condition) {
[...]
prefixSum2(array1,array2,size);
prefixSum2(array3,array4,size);
prefixSum2(array5,array6,size);
[...]
}
This way I reduce number of kernel calls, and each of them has more work to do before it ends. As expected I saved some time, so I implemented a 6-way prefix sum to reduce the time even more:
while (some_condition) {
[...]
prefixSum6(array1,array2,array3,array4,array5,array6,size);
[...]
}
This time however execution time was much much longer than even in the first example. I re-run the program several times to have consistent time readings, even restarted my computer.
So my question is why? how? and what can I do about it?
One could ask, why do I care about those few microseconds? Note that my prefixSum call is in a loop, which is repeated several times and those small overheads sum up in my code considerably. That’s why I do care.