Right way to terminate threads... ...exit or something else?

AKazak · October 25, 2010, 11:26am

Dear All,

My program uses 3D grid of 3D blocks in order to simulate 3D world in bio application. However data array don’t cover all blocks, so some threads are not effective.
I see two options for handling ineffective threads:

add IF and RETURN at the very beginning of kernel;
extend memory to cover all blocks in grid and filter needed data afterwards;

What is you general advice on keeping performance?

AKazak · October 25, 2010, 11:26am

Dear All,

My program uses 3D grid of 3D blocks in order to simulate 3D world in bio application. However data array don’t cover all blocks, so some threads are not effective.
I see two options for handling ineffective threads:

add IF and RETURN at the very beginning of kernel;
extend memory to cover all blocks in grid and filter needed data afterwards;

What is you general advice on keeping performance?

E.D_Riedijk · October 25, 2010, 11:40am

As far as I know 3D grids are not supported, so there is no need to terminate threads, your kernel will not launch…

To answer your question: you should really perform a benchmark. It depends on quite some parameters.

E.D_Riedijk · October 25, 2010, 11:40am

As far as I know 3D grids are not supported, so there is no need to terminate threads, your kernel will not launch…

To answer your question: you should really perform a benchmark. It depends on quite some parameters.

AKazak · October 25, 2010, 11:57am

Thank you. Now I understood “Maximum sizes of each dimension of a grid: 65535 x 65535 x 1” for my Quadro FX 3700M.

So what is better configuration then? To emulate 3D array of cells as a set of 3D blocks of max size in 2D grid of corresponding size?

AKazak · October 25, 2010, 11:57am

Thank you. Now I understood “Maximum sizes of each dimension of a grid: 65535 x 65535 x 1” for my Quadro FX 3700M.

So what is better configuration then? To emulate 3D array of cells as a set of 3D blocks of max size in 2D grid of corresponding size?

YDD · October 25, 2010, 5:40pm

Pretty much. You can always have an invertible function to map any cell of a hypervolume to a scalar value. So, what you would do is convert the block index (in x and y) to an integer, and then convert that integer to the appropriate index (xv, yv, zv) in your 3D volume. Alternatively, if you have enough blocks in the (x,y) plane (a few hundred, say), then keeping the same blocks, and having each march in z would be a viable solution.

YDD · October 25, 2010, 5:40pm

Pretty much. You can always have an invertible function to map any cell of a hypervolume to a scalar value. So, what you would do is convert the block index (in x and y) to an integer, and then convert that integer to the appropriate index (xv, yv, zv) in your 3D volume. Alternatively, if you have enough blocks in the (x,y) plane (a few hundred, say), then keeping the same blocks, and having each march in z would be a viable solution.

AKazak · October 25, 2010, 7:37pm

Is this mapping function expected to slow down thread performance? Or such operations as /, % don’t hamper the process so much?

AKazak · October 25, 2010, 7:37pm

Is this mapping function expected to slow down thread performance? Or such operations as /, % don’t hamper the process so much?

E.D_Riedijk · October 26, 2010, 5:16am

The idea is that you do this mapping in the beginning of your kernel code. You also need to only do it once for an entire block. So something like below:

[codebox]shared uint3 block_index;

if ((threadIdx.x == 0) &&

(threadIdx.y == 0) &&

(threadIdx.z == 0)) {

block_index = …;

}

__syncthreads();

% here comes the normal code.[/codebox]

E.D_Riedijk · October 26, 2010, 5:16am

The idea is that you do this mapping in the beginning of your kernel code. You also need to only do it once for an entire block. So something like below:

[codebox]shared uint3 block_index;

if ((threadIdx.x == 0) &&

(threadIdx.y == 0) &&

(threadIdx.z == 0)) {

block_index = …;

}

__syncthreads();

% here comes the normal code.[/codebox]

AKazak · November 5, 2010, 6:19pm

Please take a look at the beginning of my kernel function below:

__global__ void Kernel(char* SourceArray, char* TargetArray, int XSize, int YSize, int ZSize)

{

	__shared__ int BlockInGrid;

	if ( (threadIdx.x == 0) && (threadIdx.y == 0) && (threadIdx.z == 0) )

		BlockInGrid = blockIdx.x + gridDim.x * (blockIdx.y + gridDim.y * blockIdx.z); // Serial ID of a block in grid

	

	int ThreadInBlock = threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y * threadIdx.z); // Serial ID of a thread in block

	int NumberOfThreadsInBlock = blockDim.x * blockDim.y * blockDim.z; // Number of threads in block

	

	int ThreadID = ThreadInBlock + NumberOfThreadsInBlock * BlockInGrid; // Serial ID of a thread in grid

	

	int x = ThreadID % XSize + 1;

	int y = (ThreadID % (XSize * YSize)) / XSize + 1;

	int z = ThreadID / (XSize * YSize) + 1;

if ( (x > 6) || (y > 6) || (z > 6) ) ???

Now I want to make threads complying with the last if conditions to terminate (not follow code further).

What would you recommend to put instead of ??? ?

AKazak · November 5, 2010, 6:19pm

Please take a look at the beginning of my kernel function below:

__global__ void Kernel(char* SourceArray, char* TargetArray, int XSize, int YSize, int ZSize)

{

	__shared__ int BlockInGrid;

	if ( (threadIdx.x == 0) && (threadIdx.y == 0) && (threadIdx.z == 0) )

		BlockInGrid = blockIdx.x + gridDim.x * (blockIdx.y + gridDim.y * blockIdx.z); // Serial ID of a block in grid

	

	int ThreadInBlock = threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y * threadIdx.z); // Serial ID of a thread in block

	int NumberOfThreadsInBlock = blockDim.x * blockDim.y * blockDim.z; // Number of threads in block

	

	int ThreadID = ThreadInBlock + NumberOfThreadsInBlock * BlockInGrid; // Serial ID of a thread in grid

	

	int x = ThreadID % XSize + 1;

	int y = (ThreadID % (XSize * YSize)) / XSize + 1;

	int z = ThreadID / (XSize * YSize) + 1;

if ( (x > 6) || (y > 6) || (z > 6) ) ???

Now I want to make threads complying with the last if conditions to terminate (not follow code further).

What would you recommend to put instead of ??? ?

Gregory_Diamos · November 6, 2010, 8:21pm

Putting a return; is the simplest. However, you should be careful about where you do this. This example should be fine, but doing something like this:

[code]

while ( condition ){

    if ( (x > 6) || (y > 6) || (z > 6) ) return;

}

…

[code]

will wreak havoc on some of the compiler optimizations that handle thread divergence if ‘condition’ ever causes divergence. Basically you are fine as long as the statement surrounding the ‘return’ is never divergent.

Gregory_Diamos · November 6, 2010, 8:21pm

Putting a return; is the simplest. However, you should be careful about where you do this. This example should be fine, but doing something like this:

[code]

while ( condition ){

    if ( (x > 6) || (y > 6) || (z > 6) ) return;

}

…

[code]

will wreak havoc on some of the compiler optimizations that handle thread divergence if ‘condition’ ever causes divergence. Basically you are fine as long as the statement surrounding the ‘return’ is never divergent.

AKazak · November 6, 2010, 9:13pm

Putting a return; is the simplest. However, you should be careful about where you do this. This example should be fine, but doing something like this:

[code]

while ( condition ){
    if ( (x > 6) || (y > 6) || (z > 6) ) return;
}

…

[code]

will wreak havoc on some of the compiler optimizations that handle thread divergence if ‘condition’ ever causes divergence. Basically you are fine as long as the statement surrounding the ‘return’ is never divergent.

Thank you, Gregory!

Could you please clarify on what you mean under divergence in your example?

AKazak · November 6, 2010, 9:13pm

Putting a return; is the simplest. However, you should be careful about where you do this. This example should be fine, but doing something like this:

[code]

while ( condition ){
    if ( (x > 6) || (y > 6) || (z > 6) ) return;
}

…

[code]

will wreak havoc on some of the compiler optimizations that handle thread divergence if ‘condition’ ever causes divergence. Basically you are fine as long as the statement surrounding the ‘return’ is never divergent.

Thank you, Gregory!

Could you please clarify on what you mean under divergence in your example?

Gregory_Diamos · November 6, 2010, 9:45pm

This would be a simple example that would create the problem that I mentioned.

for ( int i = 0; i < 100; ++i ){

  bool finished = doWork();

  if ( threadIdx.x & 1 ) if( finished ) return;

}

The statement (threadIdx.x & 1) would always cause divergence within a warp. The return statement would prevent re-convergence.

Gregory_Diamos · November 6, 2010, 9:45pm

This would be a simple example that would create the problem that I mentioned.

for ( int i = 0; i < 100; ++i ){

  bool finished = doWork();

  if ( threadIdx.x & 1 ) if( finished ) return;

}

The statement (threadIdx.x & 1) would always cause divergence within a warp. The return statement would prevent re-convergence.

Topic		Replies	Views
Newbie help on thread blocks CUDA Programming and Performance	22	10631	December 24, 2008
3D grids CUDA Programming and Performance	13	10043	June 29, 2009
cant call any kernel function CUDA Programming and Performance	8	4843	June 6, 2011
Synchronization across all threads CUDA Programming and Performance	9	6609	August 22, 2008
global to shared mem loads and sync CUDA Programming and Performance	26	11456	February 21, 2008
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
Killing all kernel threads CUDA Programming and Performance	16	4962	October 2, 2015
Synchronization methods? CUDA Programming and Performance	11	2135	November 7, 2010
Global thread barrier CUDA Programming and Performance	78	85730	December 23, 2011
Relaxed __syncthreads() proposal. CUDA Programming and Performance	15	11546	January 2, 2011

Right way to terminate threads... ...exit or something else?

Related topics