what is difference between two kind ways of thread assignments?

I have two kind of threads assignment.

The value(dv_val) index is from 0~4095,and every location has different value.

Every thread has the exact positions to reduce the value(dv_val[loc]) until the dv_val[loc] is 0,

The program 0 is as following.

ex1: thread 0 is in charge of location 0,256,512,1024,…3840,

ex2: thread 1 is in charge of location 1,257,513,1025,…3841,

and every thread has 16(4096/256) locations to process.

[codebox]global void

program0(int* dv_val)

{

 int tid = threadIdx.x;

 for(int loc=tid;loc<(4096);loc+=256){

	 while(dv_val[loc]>=0)

      dv_val[loc]--;

 }

}[/codebox]

The program1 has a dv_boss[0](initialized to 0) which distrubutes location to every thread .

When one thread enters the while(1), dv_boss[0] will distribute a location to thread (loc = dv_boss[0]),

thread modifies location to save into boss(boss = i+1),

and the boss distributed new location to next thread next time.

The goal is that make any fast thread process more positions, so every thread has different

number of locations.

[codebox]global void

program1(int* dv_val,int* dv_boss)

{

 int tid = threadIdx.x;

 int loc;

 while(1){

      loc = dv_boss[0];

      dv_boss[0] = loc+1;

      if(loc>4095)break;

while(dv_val[loc]>=0)

      dv_val[loc]--;

 }

}[/codebox]

program0 spends 4.218 ms.

program1 spends 802.943.218 ms.

why does program1 spends more time than program0??

You really should read the programming guide carefully. The second program is full of races, threads are running in bunches called warps (of 32 threads), it will cause your second program to do everything 32 times (probably even more) (and there are many, many more problems with it)

What you are trying to do is reasonable. Presumably in your actual program you have more work to do than just decrementing memory.

But the way you are using dv_boss has a race condition, where multiple threads could load the value and then attempt to assign new conflicting values. Instead you should use atomicAdd().

Another problem has to do with how divergent threads behave. If some threads remain within the ‘while(dv_val[loc]>=0)’ loop then all the threads of the same warp will wait for the loop to finish before they proceed. Meaning if one thread has a lot of processing, then the 31 other threads of the same warp will wait for it to finish before going to get a new ‘boss’ value.

Try something along these lines:

int loc = atomicAdd(dv_boss, 1);

while (loc < 4096) {

	if (dv_val[loc] >= 0) {

		dv_val[loc]--;

	}

	else {

		loc = atomicAdd(dv_boss, 1);

	}

}

It should go without saying that making an atomic trip to memory is much slower than incrementing a register like you were doing with program0. But if the work being done is much greater than the cost of updating loc, then this is not a problem.

Thank you very much.

So this program1 has two problems:

(1)Race condition (all threads access dv_boss).

(2)while loop branch divergence (If some threads remain within loop ,some threads will

wait for the loop to finish.)

solution

(1)using atomicAdd

(2)Jamie K code

[codebox]int loc = atomicAdd(dv_boss, 1);

while (loc < 4096) {

 if (dv_val[loc] >= 0)

      dv_val[loc]--;

 else 

      loc = atomicAdd(dv_boss, 1);

}[/codebox]

I compare the program0, program1, and Jamie K code.

program0 4.218

program1 802.943

Jamie K code 32.619

Jamie K code is better than program1, because solving the two problem.

Jamie K code is worse than program0, because atomic operation willl slow performance.

But I try a new way, which just solves race condition program.

The code is as following:

[codebox]global void

program2(int* dv_val,int* dv_boss)

{

 int tid = threadIdx.x;

 int loc;

 while(1){

      loc = atomicAdd(dv_boss, 1);

      if(loc>4095)break;

while(dv_val[loc]>=0)

      dv_val[loc]--;

 }

}[/codebox]

I compare the program2 with other program.

program0 4.218

program1 802.943

Jamie K code 32.619

program2 5.193

It is reasonable that program2 is bettern than program1, but I don’t know why program2

is better than Jamie K code? And how do I implement my thought (which is faster threads process more locations ) and finally faster than program0?