How to stop the kernel effectively ? Need to stop calculations from within ...

Romant · September 12, 2009, 8:01pm

In general, my task looks like this:

float* pHugeSourceArray;

float* pHugeDestinationArray;

for (int i = 0; i < nNumHugeErrayElements; i++)

{

	pHugeDestinationArray[i] = DoSomethingGreat(pHugeSourceArray[i]);

}

As DoSomethingGreat() is absolutely independent from previous calculations, this task is perfectly parallel and runs well on CUDA (each thread does it’s single call to DoSomethingGreat()).

Sometimes (quite often) the result of DoSomethingGreate() is NAN or INF (in other words, isfinine(DoSomethingGreate()) would return 0). In this case, all the calculations for nNumHugeErrayElements are useless and it is necessary to stop the kernel informing the calling side that the infinite number has been met.

How to do it effectively ? Obvious approaches seem to drop performance significantly …

sigismondo · September 15, 2009, 9:10am

I would classify this among obvious approaches, but if you just set a check condition and check it every cycle it will need synchronization, and so it will be slow. However if you set it up to check it every let’s say 1000 cycles you are ok. So every 1000 cycles you will:

check NAN condition
if ok sinchronize the savage for possible restart
if !ok sinchronize the exit, discarding the latest 1000.

1000 is just a guess - a good number would let the kernel checks “every few seconds”.

A better approach would be to identify what lead to a NAN and correct it.

Romant · September 15, 2009, 9:44am

Could you please provide some pseudo-code for it ? I understand what you say and what I’m trying to do is similar to it but it is not effective in my implementation. The kernel looks like the one below:

__global__ Kernel(pIn, pOut)

{

	pOut[threadIdx] = DoSomething(pIn[threadIdx]);

}

sigismondo · September 15, 2009, 12:59pm

Could you please provide some pseudo-code for it ? I understand what you say and what I’m trying to do is similar to it but it is not effective in my implementation. The kernel looks like the one below:
__global__ Kernel(pIn, pOut)

{

	pOut[threadIdx] = DoSomething(pIn[threadIdx]);

}

Without “for” is another scenario. Now, how many threads are you dispatching? How much does it take one instance of DoSomething()? How many thread processors can your kernel use (I mean, in parallel), given your problem and your target GPU?

Tigga · September 15, 2009, 1:39pm

As I understand it you want to abort the kernel execution from an arbitary thread. Seems to me that this should be fairly simple. On failure the thread that fails writes to global memory. At the start of each thread execution check this global memory to see if the thread should either execute DoSomething, or return immediately. If you wanted you could even do it in the middle of DoSomething.

Given synchronisation isn’t really needed I wouldn’t think you’d need atomics. Then again I’ve never needed to do anything like this, so you may have to.

Romant · September 15, 2009, 2:25pm

Yes, this is pretty simple - but global memory reads are not fast, especially when all threads will try to read from the same address (if I’m not missing something important, all threads will read this value sequentially due to non-coalesced access).

Romant · September 15, 2009, 2:28pm

In fact, there is a loop on my kernel (DoSomething is called inside loop) but this loop is not too long - may be 12-15 iterations.

Number of threads I run is about 300-400 thousands, hardware is GTX280.

Tigga · September 15, 2009, 2:42pm

I believe it’s uncoalesced on CC < 1.2, coalesced on CC >= 1.2. On CC < 1.2 you could always get a single thread per warp/block (one will be faster) to read it into shared memory, and then work from there. I don’t think there is any other way to do it in a global manner. You could do it warp/blockwise using shared memory if that was any use.

EDIT: You’re on CC 1.3 then - Best practices guide section 3.2.1 (or Programming guid 5.1.2.1) says that this’ll be a single 32 byte transaction for every 16 threads. Might still be quicker to do the shared memory approach described above. I dunno.

Romant · September 15, 2009, 3:58pm

Did not notice that 1.3 devices are not restricted in this sense … thank you!

tmurray · September 15, 2009, 7:45pm

There’s no good way to do it in a general way at the moment.

There are lots of good but very sneaky and not very general ways to do it.

SPWorley · September 15, 2009, 8:34pm

If you want to splice in an opcode into PTX, you could just put the TRAP opcode into the stream. That aborts execution immediately.

However I don’t know if it sets a fatal (and un-resettable) error condition into the CUDA context on the CPU or not. If it does, it’s really annoying.