inter-warp min value

Hiya,

how could I get a minimum value inside a warp’s data in an optimal way? For example, this is the unoptimized code:

#define NUM_THREADS_PER_BLOCK 128

__shared__ float values[NUM_THREADS_PER_BLOCK];//this will be filled with data

if ( 0==(threadIdx.x%32) )

{

    unsigned int i;

    for ( i=threadIdx.x+1; i<threadIdx.x+32; ++i )

    {

        value[threadIdx.x] = fminf(value[i],value[threadIdx.x]);

    }

  //at this point the absolute inter-warp minimum will be stored at value[threadIdx.x]

}

This code is superslow… first due to lack of parallelization… and second due to bank conflicts.

I’ve taken a look to the reduction example… but I don’t think it could be adapted efficiently to perform the inte-warp min() operation due to bank conflicts…

thx.

In your code you have 1 thread do 31 fminf’s and you have a very expensive % operator.

I would think that it is possible with a easily modified reduction. Something along these lines, note that no if’s are needed at all, just the index of the first thread in the warp:

value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+16]);

value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+8]);

value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+4]);

value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+2]);

value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+1]);

index = floor(threadIdx.x / 32) * 32; // I am not sure you need the floor, it could be that integer division is already doing a floor for you.

// at this point value[index] has the inter-warp minimum value

But a % using a constant is supposed to be converted automatically by nvcc using

(i&(n-1)) because it’s a power of two, isn’t it?

just the index of the first thread in the warp:

value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+16]);

value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+8]);

value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+4]);

value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+2]);

value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+1]);

index = floor(threadIdx.x / 32) * 32; // I am not sure you need the floor, it could be that integer division is already doing a floor for you.

But, if threadIdx.x is the 1st thread on the warp… what happens, for example, for the threadIdx.x+3?

I don’t know about the automatic conversion, but you could always check the generated ptx.

The code is just copy&pasted from the reduction example. You need to remember that all threads of a warp run at the same time.

So for example if there were only 4 threads per warp (to keep it simple, and I padded the array):

value = [1 1 1 1 1 1 1 1]

value[idx] += value[idx + 4]

value = [2 2 2 2 1 1 1 1]

value[idx] += value[idx + 2]

value = [4 4 3 3 1 1 1 1]

value[idx] += value[idx + 1]

value = [8 7 6 4 1 1 1 1]

You see that the first element is the sum of all elements in the beginning? The rest of the array is modified, but that does not matter, since you are only interested in the first element.

That being said, you need to prevent the last 16 threads of your warps from running, otherwise they compare with values beloning to another warp. So the code should become :

index = floor(threadIdx.x / 32) * 32; // this value will be the same for all threads in a warp.

if (threadIdx.x < index+16) {

  value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+16]);

  value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+8]);

  value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+4]);

  value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+2]);

  value[threadIdx.x] = fminf(value[threadIdx.x], value[threadIdx.x+1]);

}

// at this point value[index] has the inter-warp minimum value

I tried with 32 random values(using 32 threads per block) and that code seems not work. It does not retrieve the minimum value at values[0]. I think that must be repeated a few times until there are only 2 elements or something like that.

fminf will return the minimum value, for the maximum you need to use fmaxf. And if you are running in emulation mode, you will need a syncthreads() since you will not actually be running parallel.

Just check the code in the reduction example, the last kernel. It has EMUSYNC I believe.