How can I make a minimum searching broadcast for all threads?
In a part of my program, all threads use the same minimum value in the shared memory. Now iterative searching through the shared data is done in every thread, but it is a repeatedly work. Via one sweeping in a thread the minimum value is found. Can somebody help me with simply code?
shared float testData[512];
//shared data updating
…
__syncthreads();
float min = testData[0];
for (int i = 0; i<512; i++){
if (min<testData[i])
min = testData[i];
}
Use a shared memory parallel reduction, with one or more warps of threads doing the reduction to find the minimum. You can find some code that should give you some hints how to implement it in this thread (it is performing summation, but that can be easily changed to find a minimum).
Use a shared memory parallel reduction, with one or more warps of threads doing the reduction to find the minimum. You can find some code that should give you some hints how to implement it in this thread (it is performing summation, but that can be easily changed to find a minimum).
I think, my problem is here some special. What I want ist not a global minimum in all shared memory, but the individual minimum in shared memory for each block.
For example, I have a 16080 Matrix, beacuse a complex method and serial sweeping, i can definie in my block 101 threads line by line. Now what i need, is the minimu in this 10 elements… i’tried during a inline loop to do it, but I get the wrong results… How can I make it right and effective?
in Kernel:
shared float C[10];
float minL;
for (int i = 0; i < matrix_H; i++){ //Through over the matrix
C[ix] = d_ccube[matrix_W*i + ix]; //updata the 10*1 elements
__syncthreads();
minL = C[0];
for(unsigned int s=0; s < 10; s++){
if (minL > C[s])
minL =C(s);
}
...
The code I linked to does exactly that - it performs a parallel reduction on a shared memory array using one warp of threads. The example is summation (initially partial sums are written to shared memory which were computed from global memory), but the parallel reduction code is run on a per block basis and the result is a per block result:
__syncthreads();
// Use first warp of block to compute parallel reduction
if (threadIdx.x < 32) {
#pragma unroll
for(int i=32; i<TPBR; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i];
buff[threadIdx.x] += buff[threadIdx.x+16];
buff[threadIdx.x] += buff[threadIdx.x+8];
buff[threadIdx.x] += buff[threadIdx.x+4];
buff[threadIdx.x] += buff[threadIdx.x+2];
buff[threadIdx.x] += buff[threadIdx.x+1];
}
thx for your code. Now there is still a question, if my block has two dimentions, how can I make sure, all the threads for minimum searching are in the first wrap?
Any way, here is my code (which is changed from your hints), with them I can’t get the reight result…
[codebox]global void testQmin(int *d_data, …){
/*dim3 ca_threads(4, 80, 1);
dim3 ca_grid(imgW/4,1,1);*/
const int ix = threadIdx.x;
const int iz = threadIdx.y;
const int iy = blockIdx.x;
int minL;
__shared__ int L[320];
for (int a_step = 0; a_step < imgH; a_step++){//over image
L[ix*80 + iz] = d_data[imgW*iz + iy*4 +ix -1];
__syncthreads();
if ((ix*80 + iz) < 32) {
for(int i = 32; i < 320; i += 32){
if (L[ix*80 + iz] > L[ix*80 + iz+i])
L[ix*80 + iz] = L[ix*80 + iz+i];
}
if (L[ix*80 + iz] >L[ix*80 + iz + 16])
L[ix*80 + iz] = L[ix*80 + iz + 16];
if (L[ix*80 + iz] >L[ix*80 + iz + 8])
L[ix*80 + iz] = L[ix*80 + iz + 8];
if (L[ix*80 + iz] > L[ix*80 + iz + 4])
L[ix*80 + iz] =L[ix*80 + iz + 4];
if (L[ix*80 + iz] >L[ix*80 + iz + 2])
L[ix*80 + iz] = L[ix*80 + iz + 2];
if (L[ix*80 + iz] > L[ix*80 + iz + 1])
L[ix*80 + iz] =L[ix*80 + iz + 1];
}
Threads are numbers within a block in column major order. It is explained in the programming guide, but for a 2D block, the “block” thread index is just
tidx = blockIdx.x + blockDim.x * blockIdx.y
and the threads in the first warp should be 0 <= tidx <= 31. As for your code, you might want to consider using the fminf() function. Apart from being a bit more elegant, it should compile into considerably faster running code.
I’ve changed (ix80 + iz) as tidx in my program, but it takes just the the firt value as minimu from the first four vales in block-x direction… (my block here is 480), how could be the problem?
PS. runs a wrap on one SM or SP? If on one SM (with 8 SPs), it could be wrong during parallelization as my understood…