My program uses 3D grid of 3D blocks in order to simulate 3D world in bio application. However data array don’t cover all blocks, so some threads are not effective.
I see two options for handling ineffective threads:
add IF and RETURN at the very beginning of kernel;
extend memory to cover all blocks in grid and filter needed data afterwards;
What is you general advice on keeping performance?
My program uses 3D grid of 3D blocks in order to simulate 3D world in bio application. However data array don’t cover all blocks, so some threads are not effective.
I see two options for handling ineffective threads:
add IF and RETURN at the very beginning of kernel;
extend memory to cover all blocks in grid and filter needed data afterwards;
What is you general advice on keeping performance?
Pretty much. You can always have an invertible function to map any cell of a hypervolume to a scalar value. So, what you would do is convert the block index (in x and y) to an integer, and then convert that integer to the appropriate index (xv, yv, zv) in your 3D volume. Alternatively, if you have enough blocks in the (x,y) plane (a few hundred, say), then keeping the same blocks, and having each march in z would be a viable solution.
Pretty much. You can always have an invertible function to map any cell of a hypervolume to a scalar value. So, what you would do is convert the block index (in x and y) to an integer, and then convert that integer to the appropriate index (xv, yv, zv) in your 3D volume. Alternatively, if you have enough blocks in the (x,y) plane (a few hundred, say), then keeping the same blocks, and having each march in z would be a viable solution.
The idea is that you do this mapping in the beginning of your kernel code. You also need to only do it once for an entire block. So something like below:
The idea is that you do this mapping in the beginning of your kernel code. You also need to only do it once for an entire block. So something like below:
Please take a look at the beginning of my kernel function below:
__global__ void Kernel(char* SourceArray, char* TargetArray, int XSize, int YSize, int ZSize)
{
__shared__ int BlockInGrid;
if ( (threadIdx.x == 0) && (threadIdx.y == 0) && (threadIdx.z == 0) )
BlockInGrid = blockIdx.x + gridDim.x * (blockIdx.y + gridDim.y * blockIdx.z); // Serial ID of a block in grid
int ThreadInBlock = threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y * threadIdx.z); // Serial ID of a thread in block
int NumberOfThreadsInBlock = blockDim.x * blockDim.y * blockDim.z; // Number of threads in block
int ThreadID = ThreadInBlock + NumberOfThreadsInBlock * BlockInGrid; // Serial ID of a thread in grid
int x = ThreadID % XSize + 1;
int y = (ThreadID % (XSize * YSize)) / XSize + 1;
int z = ThreadID / (XSize * YSize) + 1;
if ( (x > 6) || (y > 6) || (z > 6) ) ???
Now I want to make threads complying with the last if conditions to terminate (not follow code further).
Please take a look at the beginning of my kernel function below:
__global__ void Kernel(char* SourceArray, char* TargetArray, int XSize, int YSize, int ZSize)
{
__shared__ int BlockInGrid;
if ( (threadIdx.x == 0) && (threadIdx.y == 0) && (threadIdx.z == 0) )
BlockInGrid = blockIdx.x + gridDim.x * (blockIdx.y + gridDim.y * blockIdx.z); // Serial ID of a block in grid
int ThreadInBlock = threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y * threadIdx.z); // Serial ID of a thread in block
int NumberOfThreadsInBlock = blockDim.x * blockDim.y * blockDim.z; // Number of threads in block
int ThreadID = ThreadInBlock + NumberOfThreadsInBlock * BlockInGrid; // Serial ID of a thread in grid
int x = ThreadID % XSize + 1;
int y = (ThreadID % (XSize * YSize)) / XSize + 1;
int z = ThreadID / (XSize * YSize) + 1;
if ( (x > 6) || (y > 6) || (z > 6) ) ???
Now I want to make threads complying with the last if conditions to terminate (not follow code further).
Putting a return; is the simplest. However, you should be careful about where you do this. This example should be fine, but doing something like this:
[code]
while ( condition ){
if ( (x > 6) || (y > 6) || (z > 6) ) return;
}
…
[code]
will wreak havoc on some of the compiler optimizations that handle thread divergence if ‘condition’ ever causes divergence. Basically you are fine as long as the statement surrounding the ‘return’ is never divergent.
Putting a return; is the simplest. However, you should be careful about where you do this. This example should be fine, but doing something like this:
[code]
while ( condition ){
if ( (x > 6) || (y > 6) || (z > 6) ) return;
}
…
[code]
will wreak havoc on some of the compiler optimizations that handle thread divergence if ‘condition’ ever causes divergence. Basically you are fine as long as the statement surrounding the ‘return’ is never divergent.