Abort kernel

Any idea how can I abort the current kernel to force to read the results into the CPU?

Let me explain a bit more… I’m making a raycaster which has the following function:

__device__ bool HitTest ( const sTRIANGLE* tris, float3 rayOrig, float3 rayDir );

it just test if a ray hits a triangle set loaded using cuMalloc() in 1D linear buffer.

The .cu kernel just iterates over the triangle set until a hit is found.

If any triangle is hit by the ray I must abort all the other tests and stop the kernel ( to optimize ).

I’m using 128 threads per block, blocks of 1024 triangles each. The triangle set contains a 6M polygon mesh ( 72Mb )

I could do:

__device__ bool hitFound = false;

__device__ bool HitTest ( const sTRIANGLE* tris, float3 rayOrig, float3 rayDir )

{

    if ( hitFound )

    {

        return true;

    }

    else

    {

        //perform the ray-triangle test

        //blah blah.. some boring maths...

        if ( testGaveHit )

        { 

            hitFound = true;

            return true;

        }

    }

   return false;

}

but I bet that won’t be very efficient… Is there any built-in function to force to stop the kernel execution?

Any idea how to implement this better pls? Do I need just to divide into smaller blocks and test result block by block instead ( like is done in the 1dhaar example? )

Could you include a basic parallelized raycaster in the SDK examples?

Use a shared memory variable for “hitFound”. Use thread 0 to initialize it to false.

Threads should ONLY write to it if they find a hit; if they do, they write true. After the write to it, __syncthreads(). Then have all threads check it and return if it is true, or continue if it is not.

Even though multiple threads will be writing the variable at once, the only possible value that they are writing is “true”, so if it is ever true, you know all threads should quit.

Let me know if this isn’t clear and I will write some pseudocode.

Mark

Something like this???

#define BLOCK_SIZE 1024

#define THREADS_PER_BLOCK 128

struct sRAY

{

    float3 orig, dir;

};

struct sTRIANGLE_VERTEX

{

    float x, y, z;

};

struct sTRIANGLE

{

    sTRIANGLE_VERTEX v[3];

};

__shared__ bool hitFound;

__global__ void InitHitFound () 

{

    hitFound = false;

}

__global__ void HitTestGPU ( const sTRIANGLE* tris, const sRAY* r, bool *res )

{

    if ( !hitFound )

    {

        const sTRIANGLE *tri = &tris[(blockIdx.x*BLOCK_SIZE) + threadIdx.x];

       //blah blah maths....

        if ( rayHitsTri )

        {

            hitFound = true;

            __syncthreads();

            *res = true;

        }

    }

}

void HitTest ( const sTRIANGLE* tris, const unsigned int nTris, const sRAY* r, bool* res )

{

    InitHitFound<<<1,1>>>(); //This will init the shared hitFound to false

    

    const unsigned int nBlocks = (unsigned int)(ceilf((float)nTris/(float)BLOCK_SIZE));

    HitTestGPU<<<nBlocks,THREADS_PER_BLOCK>>> (tris,r,res);

}

???

shared data has a lifetime of a launch, hence you cannot use a launch to initializa data used in a subsequent launch. I think what Mark means is something like:

__shared__ int x;

__global__ void f(void)

{

  if (threadIdx.x == 0) {

    x = 0;

  }

  __syncthreads();

 // overhere, every thread can access x

}

But will be efficient? Threads in the block will continue executing the “if ( 0==threadIdx.x )” and the “if ( !hitFound )” for the 6M triangles… that just will skip some maths in the interior ray-triangle test branch… With a built-in intruction to abort the current executing kernel(abort all the threads in block, abort all the blocks) could be more efficient?

Other option could be to do some consecutive block tests in the host part, like:

void HitTest ( const sTRIANGLE* tris, const unsigned int nTris, const sRAY* r, bool* result )

{

   const unsigned int nBlocks = (unsigned int)(ceilf((float)nTris/(float)BLOCK_SIZE));

   unsigned int i;

  for ( i=0; i<nBlocks; ++i )

   {

      HitTestGPU<<<1,THREADS_PER_BLOCK>>> (tris,r,result);

      if ( result )

      {

         break;

      }

   }

}

but I bet this will kill performance and loose the benefits of CUDA?