Short kernel calls better?

Hi, All –

I have a reasonably straightforward, highly optimized search-type algorithm implemented in CUDA 1.1. In an attempt to increase performance, I ran a very long set of tests to determine the optimal run parameters (threads/block, shared memory size, etc). One of those parameters involved the number of target strings for every thread to search for. The pseudo code looks like this:

Begin Kernel:

    Load global memory into shared memory (coalesced read)


    For several iterations  <--- this was the parameter I was testing

        Read the short target sequence from global memory (I believe this is a coalesced read)

        Do the comparison against shared memory, write output back to global memory


End Kernel.

It turned out, keeping each kernel execution quite sort (0.5s/kernel execution, obtained by looping about 190 times) and calling it more times produced significantly better results (a 90% increase over a 5 second kernel called fewer times).

This result seems completely counter-intuitive. There is a fixed overhead for every kenrel load, and every thread must start out copying data into shared memory, which takes time (only about 3 ms, but it adds up when the kernel is called 50-100 times).

I am running Windows XP Pro SP3 with a GeForce 8600GTS (going to try it on an 8800 soon).

My thanks for any insights,

Ben Weiss

Oregon State University Graphics Group

Your pseudocode doesn’t say so, but are your work loads varying per thread, or are they all always following the same exact pipeline?

One theory if you’re assigning each thread a variable amount of work is that a long kernel is slower… some threads finish their work and sit idle while the remaining threads are still working on their jobs. Many short kernel calls can help reduce this because the loads become re-balanced each relaunch.

This isn’t going to be a factor if every thread has exactly the same workload though.

The workloads are very close to equal across the threads in a block (to the best of my ability to make them).

Interesting side note: I might have included this in the original post, but forgot.
If I sweep the “loops/thread” parameter, I get the following result (0% is defined as the optimal, 192 loops/thread).

As you can see, we get a steadily increasing advantage up until we hit a certain threshold, then the results become lower and unpredictable.

Any insights?