Broadcast for all threads

kemu · April 17, 2010, 6:14am

hi all,

How can I make a minimum searching broadcast for all threads?

In a part of my program, all threads use the same minimum value in the shared memory. Now iterative searching through the shared data is done in every thread, but it is a repeatedly work. Via one sweeping in a thread the minimum value is found. Can somebody help me with simply code?

shared float testData[512];
//shared data updating
…
__syncthreads();

float min = testData[0];

for (int i = 0; i<512; i++){
if (min<testData[i])
min = testData[i];
}

avidday · April 17, 2010, 7:34am

Use a shared memory parallel reduction, with one or more warps of threads doing the reduction to find the minimum. You can find some code that should give you some hints how to implement it in this thread (it is performing summation, but that can be easily changed to find a minimum).

avidday · April 17, 2010, 7:34am

Use a shared memory parallel reduction, with one or more warps of threads doing the reduction to find the minimum. You can find some code that should give you some hints how to implement it in this thread (it is performing summation, but that can be easily changed to find a minimum).

kemu · April 17, 2010, 10:17am

thx!

if I dont copy the data from global to shared, maybe is using atomic faster than coping+searching in shared memory?

kemu · April 17, 2010, 10:17am

thx!

if I dont copy the data from global to shared, maybe is using atomic faster than coping+searching in shared memory?

kbam · April 19, 2010, 12:47am

I calculated some performance stats on my GeForce GTX 285 a while back http://forums.nvidia.com/index.php?showtopic=159217

Random reads directly from global memory ~250 million per second.

atomic functions, average was ~2.5 million per second.

NB 1st one is random reads, I expect that with coalesced reads you get the set of 16 values for something like the cost of 1 random value.

kbam · April 19, 2010, 12:47am

I calculated some performance stats on my GeForce GTX 285 a while back http://forums.nvidia.com/index.php?showtopic=159217

Random reads directly from global memory ~250 million per second.

atomic functions, average was ~2.5 million per second.

NB 1st one is random reads, I expect that with coalesced reads you get the set of 16 values for something like the cost of 1 random value.

kemu · April 19, 2010, 10:26am

I think, my problem is here some special. What I want ist not a global minimum in all shared memory, but the individual minimum in shared memory for each block.

For example, I have a 16080 Matrix, beacuse a complex method and serial sweeping, i can definie in my block 101 threads line by line. Now what i need, is the minimu in this 10 elements… i’tried during a inline loop to do it, but I get the wrong results… How can I make it right and effective?

in Kernel:

shared float C[10];

float minL;

for (int i = 0; i < matrix_H; i++){ //Through over the matrix

 C[ix] = d_ccube[matrix_W*i + ix];          //updata  the 10*1 elements

 __syncthreads();

minL = C[0];

 for(unsigned int s=0; s < 10; s++){ 

if (minL > C[s])

    minL =C(s);

}

  ...

}

avidday · April 19, 2010, 10:53am

The code I linked to does exactly that - it performs a parallel reduction on a shared memory array using one warp of threads. The example is summation (initially partial sums are written to shared memory which were computed from global memory), but the parallel reduction code is run on a per block basis and the result is a per block result:

__syncthreads();

	// Use first warp of block to compute parallel reduction

	if (threadIdx.x < 32) {

		#pragma unroll

		for(int i=32; i<TPBR; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i]; 

		buff[threadIdx.x] += buff[threadIdx.x+16];

		buff[threadIdx.x] += buff[threadIdx.x+8];

		buff[threadIdx.x] += buff[threadIdx.x+4];

		buff[threadIdx.x] += buff[threadIdx.x+2];

		buff[threadIdx.x] += buff[threadIdx.x+1];

	}

kemu · April 19, 2010, 1:07pm

hi avidday,

thx for your code. Now there is still a question, if my block has two dimentions, how can I make sure, all the threads for minimum searching are in the first wrap?

Any way, here is my code (which is changed from your hints), with them I can’t get the reight result…

[codebox]global void testQmin(int *d_data, …){

/*dim3 ca_threads(4, 80, 1); 

dim3 ca_grid(imgW/4,1,1);*/ 

const int ix = threadIdx.x;

const int iz = threadIdx.y;

    const int iy = blockIdx.x;

int minL;

__shared__ int L[320];



for (int a_step = 0; a_step < imgH; a_step++){//over image

	L[ix*80 + iz] = d_data[imgW*iz + iy*4 +ix -1];

	__syncthreads();

	if ((ix*80 + iz) < 32) {

		for(int i = 32; i < 320; i += 32){ 

			if (L[ix*80 + iz] > L[ix*80 + iz+i])

				L[ix*80 + iz] = L[ix*80 + iz+i]; 

		}

		

		if (L[ix*80 + iz] >L[ix*80 + iz + 16])

			L[ix*80 + iz] = L[ix*80 + iz + 16];

		if (L[ix*80 + iz] >L[ix*80 + iz + 8])

			L[ix*80 + iz] = L[ix*80 + iz + 8];

		if (L[ix*80 + iz] > L[ix*80 + iz + 4])

			L[ix*80 + iz] =L[ix*80 + iz + 4];

		if (L[ix*80 + iz] >L[ix*80 + iz + 2])

			L[ix*80 + iz] = L[ix*80 + iz + 2];

		if (L[ix*80 + iz] > L[ix*80 + iz + 1])

			L[ix*80 + iz] =L[ix*80 + iz + 1];

	}

…

}[/codebox]

avidday · April 19, 2010, 2:03pm

Threads are numbers within a block in column major order. It is explained in the programming guide, but for a 2D block, the “block” thread index is just

tidx = blockIdx.x + blockDim.x * blockIdx.y

and the threads in the first warp should be 0 <= tidx <= 31. As for your code, you might want to consider using the fminf() function. Apart from being a bit more elegant, it should compile into considerably faster running code.

kemu · April 19, 2010, 3:10pm

Threads are numbers within a block in column major order. It is explained in the programming guide, but for a 2D block, the “block” thread index is just
tidx = blockIdx.x + blockDim.x * blockIdx.y
and the threads in the first warp should be 0 <= tidx <= 31. As for your code, you might want to consider using the fminf() function. Apart from being a bit more elegant, it should compile into considerably faster running code.

thx!

I’ve changed (ix80 + iz) as tidx in my program, but it takes just the the firt value as minimu from the first four vales in block-x direction… (my block here is 480), how could be the problem?

PS. runs a wrap on one SM or SP? If on one SM (with 8 SPs), it could be wrong during parallelization as my understood…

Topic		Replies	Views
Best way to find many minimums CUDA Programming and Performance	8	2264	January 3, 2018
Finding minimum among multiple threads CUDA Programming and Performance	13	5158	August 11, 2013
parallel way to find min CUDA Programming and Performance	21	7175	April 15, 2011
find minimum num in array #2 CUDA Programming and Performance	15	6503	October 31, 2011
Reduction on odd number of thread / block CUDA Programming and Performance	5	1929	December 15, 2012
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16288	January 30, 2011
"any"/"all" boolean operation between threads Efficient thread co-oporation CUDA Programming and Performance	6	2957	February 18, 2008
Fastest single-warp min() algorithm for 32 values CUDA Programming and Performance	15	3633	June 20, 2011
Code optimization how to optimize CUDA Programming and Performance	6	1065	June 1, 2011
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7713	July 22, 2010

Broadcast for all threads

Related topics