Kernal function in a loop. is it fine?


I have a CPU code, like bwlow…

for( int i = 0 ; i < 1024 ; ++i )
for( int j = 0; j < 1024; ++j )

I wanted to run the above code into GPU, so I thouhgt that write a kernal function for inner loop. like following…

for( int i = 0; i < 1024 ; ++i )
Kernal<<<16,16>>>( … );

Is it fine?

Is there any preformance effect?

what are the other side effects?.

Is this a homework question?

ofcouse you can do like that.
But you must pay a lot of time for calling kernel function for many times.
In my experiment, calling an empty kernel function it takes nearest 30 Microseconds.
Why don’t you put the “loop” inside kernel. I think it is more effective.

Hi Manjunath,

Thanks for joining us! I’m certain the programming guide will answer all your questions much better than I can, but I will try to provide you with the answers you need.


Do you mean performance?

Nausea, dizziness, loss of time, ingestion of high quantities of caffeine, sleepless nights, and potential loss of patience.


are you sure having a big loop inside a Kernel is speeding things up? I got no good experience having a simple ++i within a Kernel. Any CPU can execute this faster. And of course, a Kernel-call takes some time, you are right. Any hints, further discussion on that might be appreciated.



Hi again,

so did you find a good solution for the loop prob?



It all depends on what “// A LOT LOT LOT LOT LOT OF CALCULATIONS HERE…” means.

If you put too much stress on a kernel, assuming you’re using GeForce or Quadro chips, the kernel will timeout on you, and headaches will haunt you for days. However, by the way you’re setting things up, it seems that you have a 3D computational arrangement (an i and j loops inside the kernel, and the loop outside the kernel can be treated as k). As long as “// A LOT LOT LOT LOT LOT OF CALCULATIONS HERE…”[i1][j1][k1] doesn’t in any way depend on “// A LOT LOT LOT LOT LOT OF CALCULATIONS HERE…”[i2][j2][k2], then you can try and set up three-dimensional blocks, and eliminate both loops altogether.

Not knowing what “// A LOT LOT LOT LOT LOT OF CALCULATIONS HERE…” means, I can’t give a definite answer on what’s faster, what’s slower, or what’s completely idiotic. I do think, however, that the two loops in the kernel are doing the same “// A LOT LOT LOT LOT LOT OF CALCULATIONS HERE…” a million+ times, and that may not be what you want.