I have a kernel which I need to launch an extremely high number of times (basically it’ll use the thread id as "input" and manipulate it around, and I need to cover everything up to around 10^14 as inputs.)
What’d be the fastest way of basically having your kernel count to a number as high as that? Should I launch the kernel more times from the CPU, or do for example 10 loop iterations within the kernel? (Although from what I’ve understood to maximize parallelism you’ll want to unroll loops into threads)
It all depends on how much work you do in a kernel. It does cost something to launch a kernel. Microbenchmarks people have done (you can find them on the forums) put that number at ~10 microseconds. So if each of your launches runs for milliseconds, than don’t worry about the overhead. But if each launch only runs 1 microsecond, you can get a lot of benefit from batching.
Keep in mind that if your GPU is also your graphics card, your kernel will be killed if it runs longer than 1 or 2 seconds. There’s a way to disable this, but it can be a good thing, because your display will be frozen while a kernel is running.
I’m running a large job as well, so I’ve spent some time playing with how long each kernel should be. 1-10 milliseconds seems to work well. It’s long enough that the kernel startup time is negligible, and it’s short enough that the display remains responsive. With my program, that works out to having each thread perform one iteration of the main loop, with 128 threads/block and 8192 thread blocks.
I was actually able to find a different approach to solving my problem, and now I only have to go up to max signed int rather than insane values. The maximum theoretical time it would take for my program to do it’s job is now at about 2 hours. If I could figure out a way to avoid addition (seems counter intuitive that addition would be the bottleneck), I could reduce my kernel duration from 160 to 60ms. I’m adding 0…5 to 0…5, so there are some multiples of two there, I wonder if the hardware does that? (Checks if an addition is actually multiplying by two)
I would profile your app to try to identify whether it is actually the addition, or global memory accesses that are your limiting factor. I also have an application which involves lots of small iterative kernels and the vast majority of my optimisations have been squeezing my memory accesses. This can involve changing how you load/store data, how you arrange your instructions (instruction level parallelism) and even some funky things to do with a single thread doing multiple tasks.
I would recommend focussing on that, and perhaps reading: http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf as it has some very non-intuitive optimisations.
As an aside I highly doubt that it would do any checking whether you are multiplying by two as it is very unlikely there will be a performance benefit to doing a mul instead of an add.