Broadcast for all threads

hi all,

How can I make a minimum searching broadcast for all threads?

In a part of my program, all threads use the same minimum value in the shared memory. Now iterative searching through the shared data is done in every thread, but it is a repeatedly work. Via one sweeping in a thread the minimum value is found. Can somebody help me with simply code?

shared float testData[512];
//shared data updating

__syncthreads();

float min = testData[0];

for (int i = 0; i<512; i++){
if (min<testData[i])
min = testData[i];
}

Use a shared memory parallel reduction, with one or more warps of threads doing the reduction to find the minimum. You can find some code that should give you some hints how to implement it in this thread (it is performing summation, but that can be easily changed to find a minimum).

Use a shared memory parallel reduction, with one or more warps of threads doing the reduction to find the minimum. You can find some code that should give you some hints how to implement it in this thread (it is performing summation, but that can be easily changed to find a minimum).

thx!

if I dont copy the data from global to shared, maybe is using atomic faster than coping+searching in shared memory?

thx!

if I dont copy the data from global to shared, maybe is using atomic faster than coping+searching in shared memory?

I calculated some performance stats on my GeForce GTX 285 a while back http://forums.nvidia.com/index.php?showtopic=159217

Random reads directly from global memory ~250 million per second.

atomic functions, average was ~2.5 million per second.

NB 1st one is random reads, I expect that with coalesced reads you get the set of 16 values for something like the cost of 1 random value.

I calculated some performance stats on my GeForce GTX 285 a while back http://forums.nvidia.com/index.php?showtopic=159217

Random reads directly from global memory ~250 million per second.

atomic functions, average was ~2.5 million per second.

NB 1st one is random reads, I expect that with coalesced reads you get the set of 16 values for something like the cost of 1 random value.

I think, my problem is here some special. What I want ist not a global minimum in all shared memory, but the individual minimum in shared memory for each block.

For example, I have a 16080 Matrix, beacuse a complex method and serial sweeping, i can definie in my block 101 threads line by line. Now what i need, is the minimu in this 10 elements… i’tried during a inline loop to do it, but I get the wrong results… How can I make it right and effective?

in Kernel:

shared float C[10];

float minL;

for (int i = 0; i < matrix_H; i++){ //Through over the matrix

 C[ix] = d_ccube[matrix_W*i + ix];          //updata  the 10*1 elements

 __syncthreads();

minL = C[0];

 for(unsigned int s=0; s < 10; s++){ 

if (minL > C[s])

    minL =C(s);

}

  ...

}

The code I linked to does exactly that - it performs a parallel reduction on a shared memory array using one warp of threads. The example is summation (initially partial sums are written to shared memory which were computed from global memory), but the parallel reduction code is run on a per block basis and the result is a per block result:

__syncthreads();

	// Use first warp of block to compute parallel reduction

	if (threadIdx.x < 32) {

		#pragma unroll

		for(int i=32; i<TPBR; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i]; 

		buff[threadIdx.x] += buff[threadIdx.x+16];

		buff[threadIdx.x] += buff[threadIdx.x+8];

		buff[threadIdx.x] += buff[threadIdx.x+4];

		buff[threadIdx.x] += buff[threadIdx.x+2];

		buff[threadIdx.x] += buff[threadIdx.x+1];

	}

hi avidday,

thx for your code. Now there is still a question, if my block has two dimentions, how can I make sure, all the threads for minimum searching are in the first wrap?

Any way, here is my code (which is changed from your hints), with them I can’t get the reight result…

[codebox]global void testQmin(int *d_data, …){

/*dim3 ca_threads(4, 80, 1); 

dim3 ca_grid(imgW/4,1,1);*/ 

const int ix = threadIdx.x;

const int iz = threadIdx.y;

    const int iy = blockIdx.x;

int minL;

__shared__ int L[320];



for (int a_step = 0; a_step < imgH; a_step++){//over image

	L[ix*80 + iz] = d_data[imgW*iz + iy*4 +ix -1];

	__syncthreads();

	if ((ix*80 + iz) < 32) {

		for(int i = 32; i < 320; i += 32){ 

			if (L[ix*80 + iz] > L[ix*80 + iz+i])

				L[ix*80 + iz] = L[ix*80 + iz+i]; 

		}

		

		if (L[ix*80 + iz] >L[ix*80 + iz + 16])

			L[ix*80 + iz] = L[ix*80 + iz + 16];

		if (L[ix*80 + iz] >L[ix*80 + iz + 8])

			L[ix*80 + iz] = L[ix*80 + iz + 8];

		if (L[ix*80 + iz] > L[ix*80 + iz + 4])

			L[ix*80 + iz] =L[ix*80 + iz + 4];

		if (L[ix*80 + iz] >L[ix*80 + iz + 2])

			L[ix*80 + iz] = L[ix*80 + iz + 2];

		if (L[ix*80 + iz] > L[ix*80 + iz + 1])

			L[ix*80 + iz] =L[ix*80 + iz + 1];

	}

}

}[/codebox]

Threads are numbers within a block in column major order. It is explained in the programming guide, but for a 2D block, the “block” thread index is just

tidx = blockIdx.x + blockDim.x * blockIdx.y

and the threads in the first warp should be 0 <= tidx <= 31. As for your code, you might want to consider using the fminf() function. Apart from being a bit more elegant, it should compile into considerably faster running code.

thx!

I’ve changed (ix80 + iz) as tidx in my program, but it takes just the the firt value as minimu from the first four vales in block-x direction… (my block here is 480), how could be the problem?

PS. runs a wrap on one SM or SP? If on one SM (with 8 SPs), it could be wrong during parallelization as my understood…