Hi, All –
I have a reasonably straightforward, highly optimized search-type algorithm implemented in CUDA 1.1. In an attempt to increase performance, I ran a very long set of tests to determine the optimal run parameters (threads/block, shared memory size, etc). One of those parameters involved the number of target strings for every thread to search for. The pseudo code looks like this:
Begin Kernel:
Load global memory into shared memory (coalesced read)
For several iterations <--- this was the parameter I was testing
Read the short target sequence from global memory (I believe this is a coalesced read)
Do the comparison against shared memory, write output back to global memory
Next
End Kernel.
It turned out, keeping each kernel execution quite sort (0.5s/kernel execution, obtained by looping about 190 times) and calling it more times produced significantly better results (a 90% increase over a 5 second kernel called fewer times).
This result seems completely counter-intuitive. There is a fixed overhead for every kenrel load, and every thread must start out copying data into shared memory, which takes time (only about 3 ms, but it adds up when the kernel is called 50-100 times).
I am running Windows XP Pro SP3 with a GeForce 8600GTS (going to try it on an 8800 soon).
My thanks for any insights,
Ben Weiss
Oregon State University Graphics Group