Kernel call costs?

I am trying to optimalize my code as much as possible. In my code at some point I have several prefix scans to perform. The arrays are of size up to ~4000 each, and in given iteration they have the same size. In short code looks something like this:

while (some_condition) {

  [...]

  prefixSum(array1,size);

  prefixSum(array2,size);

  prefixSum(array3,size);

  prefixSum(array4,size);

  prefixSum(array5,size);

  prefixSum(array6,size);

  [...]

}

prefixSum is a host function which then calls 1 kernel if array is smaller than 512 or more kernels for bigger arrays.

According to profiler, each kernel of prefix sum takes about 4 microseconds to compute, but there are over 16 microseconds wasted for calling the kernel. Then I thought, how about writing one kernel which would perform the same operation on two arrays at once:

while (some_condition) {

  [...]

  prefixSum2(array1,array2,size);

  prefixSum2(array3,array4,size);

  prefixSum2(array5,array6,size);

  [...]

}

This way I reduce number of kernel calls, and each of them has more work to do before it ends. As expected I saved some time, so I implemented a 6-way prefix sum to reduce the time even more:

while (some_condition) {

  [...]

  prefixSum6(array1,array2,array3,array4,array5,array6,size);

  [...]

}

This time however execution time was much much longer than even in the first example. I re-run the program several times to have consistent time readings, even restarted my computer.

So my question is why? how? and what can I do about it?

One could ask, why do I care about those few microseconds? Note that my prefixSum call is in a loop, which is repeated several times and those small overheads sum up in my code considerably. That’s why I do care.

The thing is:

If your kernel execution time + kernel launch time is still fetching you speedups (compared to CPU), I think you should live with it.

Consider using one cudaThreadSynchronize() for many kernel calls ( that will queue the kernels) unless your application logic does not allow such things.

However, I dont understand why your 6-prefix thing is taking lot more time… May b, it is just the logic is not GPU pleasing…

Good Luck.

There are no synchronize instructions in prefixSum function.

Well then, you need to make sure you do more computation per kernel launch…

And, that is exactly what u have tried to do. The only way out is to find why the prefixSum 6 thing is running slower.

do you data get corrupt at any point? I noticed that happen to me after 4000 kernel calls.

Can’t help you out without a better idea of what your code is actually doing. However, I’m going to guess that the work per CTA (CTA = block) in your implementation varies significantly, so you might be hitting load balancing penalties (the work distributor is optimized for homogeneous workloads per CTA).

I think ever block of my kernel has the same workload (with an exception for the last one) as every block is doing exactly the same.

host function prefixSum looks approximately as follows:

(cudaPrefix)

  • call cudaPrefix kernel for every 512-long chunk

  • if array is longer, perform cudaPrefix kernel for every 512’th element in 512x512 chunks (ok, that’s not very coalesced). Let’s call those 512’th “major” elements.

  • if array is even longer, perform cudaPrefix kernel for every 512x512’th element in 512x512x512 chunks… Let’s call those 512x512’th - “supermajor” elements.

and later

(redistribute)

  • add supermajor element back to corresponding major elements (if applicable)

  • add major element back to relevant elements

cudaPrefix kernel:

  • load 512 int array into shared memory

  • perform a prefix scan on that array

  • upload the result back to global memory

redistribute kernel:

  • load (super)major element into shared variable by one thread

  • add the element back to all 511 following elements (one per thread)

Today I tested the algorithm at university’s GTX 280 and there using prefixSum6 actually did help! It is only my GTX 260 which seems to have problems.