Extremely high number of iterations

Juippi · February 9, 2013, 5:34am

I have a kernel which I need to launch an extremely high number of times (basically it’ll use the thread id as "input" and manipulate it around, and I need to cover everything up to around 10^14 as inputs.)

What’d be the fastest way of basically having your kernel count to a number as high as that? Should I launch the kernel more times from the CPU, or do for example 10 loop iterations within the kernel? (Although from what I’ve understood to maximize parallelism you’ll want to unroll loops into threads)

DrAnderson42 · February 11, 2013, 1:21pm

It all depends on how much work you do in a kernel. It does cost something to launch a kernel. Microbenchmarks people have done (you can find them on the forums) put that number at ~10 microseconds. So if each of your launches runs for milliseconds, than don’t worry about the overhead. But if each launch only runs 1 microsecond, you can get a lot of benefit from batching.

oshkosher · February 13, 2013, 5:10pm

Keep in mind that if your GPU is also your graphics card, your kernel will be killed if it runs longer than 1 or 2 seconds. There’s a way to disable this, but it can be a good thing, because your display will be frozen while a kernel is running.

I’m running a large job as well, so I’ve spent some time playing with how long each kernel should be. 1-10 milliseconds seems to work well. It’s long enough that the kernel startup time is negligible, and it’s short enough that the display remains responsive. With my program, that works out to having each thread perform one iteration of the main loop, with 128 threads/block and 8192 thread blocks.

Juippi · February 14, 2013, 4:42am

I was actually able to find a different approach to solving my problem, and now I only have to go up to max signed int rather than insane values. The maximum theoretical time it would take for my program to do it’s job is now at about 2 hours. If I could figure out a way to avoid addition (seems counter intuitive that addition would be the bottleneck), I could reduce my kernel duration from 160 to 60ms. I’m adding 0…5 to 0…5, so there are some multiples of two there, I wonder if the hardware does that? (Checks if an addition is actually multiplying by two)

Tiomat · February 14, 2013, 10:00am

I would profile your app to try to identify whether it is actually the addition, or global memory accesses that are your limiting factor. I also have an application which involves lots of small iterative kernels and the vast majority of my optimisations have been squeezing my memory accesses. This can involve changing how you load/store data, how you arrange your instructions (instruction level parallelism) and even some funky things to do with a single thread doing multiple tasks.
I would recommend focussing on that, and perhaps reading: http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf as it has some very non-intuitive optimisations.

As an aside I highly doubt that it would do any checking whether you are multiplying by two as it is very unlikely there will be a performance benefit to doing a mul instead of an add.

Juippi · February 14, 2013, 11:06pm

(Summing nval and val)

Replacing

nval = val + nval

with

if (nval == val)
    nval = val << 1;
else
    nval = val + nval;

Reduced my kernel runtime from 160ms to 148ms… I’ll have a look at the pdf

EDIT: Found a sneaky pattern that allows me to cut the time of my kernel into a sixth. Starting to look good.

Topic		Replies	Views
loop inside a kernel How many interrations? CUDA Programming and Performance	3	3195	July 20, 2009
Control number of threads CUDA Programming and Performance	2	3036	July 4, 2008
Some kernel launch is taking much longer (100x) than others in the same Cuda Stream CUDA Programming and Performance	7	421	February 10, 2024
Odd Slowdown Problem Same function slows down in loop CUDA Programming and Performance	3	9872	February 8, 2008
kernel performance and number of threads CUDA Programming and Performance	2	6594	November 22, 2007
Avoid thread launch overhead? CUDA Programming and Performance	5	3176	April 6, 2009
cost for launching (a lot of) CUDA kernels CUDA Programming and Performance	5	9707	April 15, 2010
Can too many kernel calls affect the performance ? CUDA Programming and Performance	4	983	June 10, 2010
reasons why splitting large kernel to smaller one lower perfromance CUDA Programming and Performance	4	3717	February 15, 2016
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3665	October 21, 2017

Extremely high number of iterations

Related topics