Right way to terminate threads... ...exit or something else?

Dear All,

My program uses 3D grid of 3D blocks in order to simulate 3D world in bio application. However data array don’t cover all blocks, so some threads are not effective.
I see two options for handling ineffective threads:

  1. add IF and RETURN at the very beginning of kernel;
  2. extend memory to cover all blocks in grid and filter needed data afterwards;

What is you general advice on keeping performance?

Dear All,

My program uses 3D grid of 3D blocks in order to simulate 3D world in bio application. However data array don’t cover all blocks, so some threads are not effective.
I see two options for handling ineffective threads:

  1. add IF and RETURN at the very beginning of kernel;
  2. extend memory to cover all blocks in grid and filter needed data afterwards;

What is you general advice on keeping performance?

As far as I know 3D grids are not supported, so there is no need to terminate threads, your kernel will not launch…

To answer your question: you should really perform a benchmark. It depends on quite some parameters.

As far as I know 3D grids are not supported, so there is no need to terminate threads, your kernel will not launch…

To answer your question: you should really perform a benchmark. It depends on quite some parameters.

Thank you. Now I understood “Maximum sizes of each dimension of a grid: 65535 x 65535 x 1” for my Quadro FX 3700M.

So what is better configuration then? To emulate 3D array of cells as a set of 3D blocks of max size in 2D grid of corresponding size?

Thank you. Now I understood “Maximum sizes of each dimension of a grid: 65535 x 65535 x 1” for my Quadro FX 3700M.

So what is better configuration then? To emulate 3D array of cells as a set of 3D blocks of max size in 2D grid of corresponding size?

Pretty much. You can always have an invertible function to map any cell of a hypervolume to a scalar value. So, what you would do is convert the block index (in x and y) to an integer, and then convert that integer to the appropriate index (xv, yv, zv) in your 3D volume. Alternatively, if you have enough blocks in the (x,y) plane (a few hundred, say), then keeping the same blocks, and having each march in z would be a viable solution.

Pretty much. You can always have an invertible function to map any cell of a hypervolume to a scalar value. So, what you would do is convert the block index (in x and y) to an integer, and then convert that integer to the appropriate index (xv, yv, zv) in your 3D volume. Alternatively, if you have enough blocks in the (x,y) plane (a few hundred, say), then keeping the same blocks, and having each march in z would be a viable solution.

Is this mapping function expected to slow down thread performance? Or such operations as /, % don’t hamper the process so much?

Is this mapping function expected to slow down thread performance? Or such operations as /, % don’t hamper the process so much?

The idea is that you do this mapping in the beginning of your kernel code. You also need to only do it once for an entire block. So something like below:

[codebox]shared uint3 block_index;

if ((threadIdx.x == 0) &&

(threadIdx.y == 0) &&

(threadIdx.z == 0)) {

block_index = …;

}

__syncthreads();

% here comes the normal code.[/codebox]

The idea is that you do this mapping in the beginning of your kernel code. You also need to only do it once for an entire block. So something like below:

[codebox]shared uint3 block_index;

if ((threadIdx.x == 0) &&

(threadIdx.y == 0) &&

(threadIdx.z == 0)) {

block_index = …;

}

__syncthreads();

% here comes the normal code.[/codebox]

Please take a look at the beginning of my kernel function below:

__global__ void Kernel(char* SourceArray, char* TargetArray, int XSize, int YSize, int ZSize)

{

	__shared__ int BlockInGrid;

	if ( (threadIdx.x == 0) && (threadIdx.y == 0) && (threadIdx.z == 0) )

		BlockInGrid = blockIdx.x + gridDim.x * (blockIdx.y + gridDim.y * blockIdx.z); // Serial ID of a block in grid

	

	int ThreadInBlock = threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y * threadIdx.z); // Serial ID of a thread in block

	int NumberOfThreadsInBlock = blockDim.x * blockDim.y * blockDim.z; // Number of threads in block

	

	int ThreadID = ThreadInBlock + NumberOfThreadsInBlock * BlockInGrid; // Serial ID of a thread in grid

	

	int x = ThreadID % XSize + 1;

	int y = (ThreadID % (XSize * YSize)) / XSize + 1;

	int z = ThreadID / (XSize * YSize) + 1;

if ( (x > 6) || (y > 6) || (z > 6) ) ???

Now I want to make threads complying with the last if conditions to terminate (not follow code further).

What would you recommend to put instead of ??? ?

Please take a look at the beginning of my kernel function below:

__global__ void Kernel(char* SourceArray, char* TargetArray, int XSize, int YSize, int ZSize)

{

	__shared__ int BlockInGrid;

	if ( (threadIdx.x == 0) && (threadIdx.y == 0) && (threadIdx.z == 0) )

		BlockInGrid = blockIdx.x + gridDim.x * (blockIdx.y + gridDim.y * blockIdx.z); // Serial ID of a block in grid

	

	int ThreadInBlock = threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y * threadIdx.z); // Serial ID of a thread in block

	int NumberOfThreadsInBlock = blockDim.x * blockDim.y * blockDim.z; // Number of threads in block

	

	int ThreadID = ThreadInBlock + NumberOfThreadsInBlock * BlockInGrid; // Serial ID of a thread in grid

	

	int x = ThreadID % XSize + 1;

	int y = (ThreadID % (XSize * YSize)) / XSize + 1;

	int z = ThreadID / (XSize * YSize) + 1;

if ( (x > 6) || (y > 6) || (z > 6) ) ???

Now I want to make threads complying with the last if conditions to terminate (not follow code further).

What would you recommend to put instead of ??? ?

Putting a return; is the simplest. However, you should be careful about where you do this. This example should be fine, but doing something like this:

[code]

while ( condition ){

    if ( (x > 6) || (y > 6) || (z > 6) ) return;

}

[code]

will wreak havoc on some of the compiler optimizations that handle thread divergence if ‘condition’ ever causes divergence. Basically you are fine as long as the statement surrounding the ‘return’ is never divergent.

Putting a return; is the simplest. However, you should be careful about where you do this. This example should be fine, but doing something like this:

[code]

while ( condition ){

    if ( (x > 6) || (y > 6) || (z > 6) ) return;

}

[code]

will wreak havoc on some of the compiler optimizations that handle thread divergence if ‘condition’ ever causes divergence. Basically you are fine as long as the statement surrounding the ‘return’ is never divergent.

Thank you, Gregory!

Could you please clarify on what you mean under divergence in your example?

Thank you, Gregory!

Could you please clarify on what you mean under divergence in your example?

This would be a simple example that would create the problem that I mentioned.

for ( int i = 0; i < 100; ++i ){

  bool finished = doWork();

  if ( threadIdx.x & 1 ) if( finished ) return;

}

The statement (threadIdx.x & 1) would always cause divergence within a warp. The return statement would prevent re-convergence.

This would be a simple example that would create the problem that I mentioned.

for ( int i = 0; i < 100; ++i ){

  bool finished = doWork();

  if ( threadIdx.x & 1 ) if( finished ) return;

}

The statement (threadIdx.x & 1) would always cause divergence within a warp. The return statement would prevent re-convergence.