float* pHugeSourceArray;
float* pHugeDestinationArray;
for (int i = 0; i < nNumHugeErrayElements; i++)
{
pHugeDestinationArray[i] = DoSomethingGreat(pHugeSourceArray[i]);
}
As DoSomethingGreat() is absolutely independent from previous calculations, this task is perfectly parallel and runs well on CUDA (each thread does it’s single call to DoSomethingGreat()).
Sometimes (quite often) the result of DoSomethingGreate() is NAN or INF (in other words, isfinine(DoSomethingGreate()) would return 0). In this case, all the calculations for nNumHugeErrayElements are useless and it is necessary to stop the kernel informing the calling side that the infinite number has been met.
How to do it effectively ? Obvious approaches seem to drop performance significantly …
I would classify this among obvious approaches, but if you just set a check condition and check it every cycle it will need synchronization, and so it will be slow. However if you set it up to check it every let’s say 1000 cycles you are ok. So every 1000 cycles you will:
check NAN condition
if ok sinchronize the savage for possible restart
if !ok sinchronize the exit, discarding the latest 1000.
1000 is just a guess - a good number would let the kernel checks “every few seconds”.
A better approach would be to identify what lead to a NAN and correct it.
Could you please provide some pseudo-code for it ? I understand what you say and what I’m trying to do is similar to it but it is not effective in my implementation. The kernel looks like the one below:
Without “for” is another scenario. Now, how many threads are you dispatching? How much does it take one instance of DoSomething()? How many thread processors can your kernel use (I mean, in parallel), given your problem and your target GPU?
As I understand it you want to abort the kernel execution from an arbitary thread. Seems to me that this should be fairly simple. On failure the thread that fails writes to global memory. At the start of each thread execution check this global memory to see if the thread should either execute DoSomething, or return immediately. If you wanted you could even do it in the middle of DoSomething.
Given synchronisation isn’t really needed I wouldn’t think you’d need atomics. Then again I’ve never needed to do anything like this, so you may have to.
Yes, this is pretty simple - but global memory reads are not fast, especially when all threads will try to read from the same address (if I’m not missing something important, all threads will read this value sequentially due to non-coalesced access).
I believe it’s uncoalesced on CC < 1.2, coalesced on CC >= 1.2. On CC < 1.2 you could always get a single thread per warp/block (one will be faster) to read it into shared memory, and then work from there. I don’t think there is any other way to do it in a global manner. You could do it warp/blockwise using shared memory if that was any use.
EDIT: You’re on CC 1.3 then - Best practices guide section 3.2.1 (or Programming guid 5.1.2.1) says that this’ll be a single 32 byte transaction for every 16 threads. Might still be quicker to do the shared memory approach described above. I dunno.