I have two kind of threads assignment.
The value(dv_val) index is from 0~4095,and every location has different value.
Every thread has the exact positions to reduce the value(dv_val[loc]) until the dv_val[loc] is 0,
The program 0 is as following.
ex1: thread 0 is in charge of location 0,256,512,1024,…3840,
ex2: thread 1 is in charge of location 1,257,513,1025,…3841,
and every thread has 16(4096/256) locations to process.
[codebox]global void
program0(int* dv_val)
{
int tid = threadIdx.x;
for(int loc=tid;loc<(4096);loc+=256){
while(dv_val[loc]>=0)
dv_val[loc]--;
}
}[/codebox]
The program1 has a dv_boss[0](initialized to 0) which distrubutes location to every thread .
When one thread enters the while(1), dv_boss[0] will distribute a location to thread (loc = dv_boss[0]),
thread modifies location to save into boss(boss = i+1),
and the boss distributed new location to next thread next time.
The goal is that make any fast thread process more positions, so every thread has different
number of locations.
[codebox]global void
program1(int* dv_val,int* dv_boss)
{
int tid = threadIdx.x;
int loc;
while(1){
loc = dv_boss[0];
dv_boss[0] = loc+1;
if(loc>4095)break;
while(dv_val[loc]>=0)
dv_val[loc]--;
}
}[/codebox]
program0 spends 4.218 ms.
program1 spends 802.943.218 ms.
why does program1 spends more time than program0??