I am implementing some sort of work distribution mechanism.

I have a nested work list, like {{w1, w2}, {w3, w4, w5, w6}, {w7}, {w8, w9, w10}}

The work distribution mechanism goes as follows:

- Store the work amount in each sub-list into an array,

2, Compute the prefix-sum of the array - Each thread use binary search to find its work

The whole code

```
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define BLOCK_SIZE 32
__device__ int test(int pSum) {
volatile __shared__ int prefixsum[BLOCK_SIZE];
int count = 0;
prefixsum[threadIdx.x] = pSum;
while(__syncthreads_or(pSum > 0)) {
int min = 0;
int max = BLOCK_SIZE-1;
while(min < max) {
int mid = (min+max)/2;
if(prefixsum[mid] <= threadIdx.x)
min = mid+1;
else
max = mid;
}
__syncthreads();
if(prefixsum[min]>=BLOCK_SIZE && (min==0||prefixsum[min-1]==0)) {
count++;
if(pSum > 0)
pSum -= BLOCK_SIZE;
prefixsum[threadIdx.x] = pSum;
__syncthreads();
}
else {
if(pSum >= BLOCK_SIZE)
pSum -= BLOCK_SIZE;
else
pSum = 0;
prefixsum[threadIdx.x] = pSum;
__syncthreads();
}
}
return count;
}
__device__ int D_Prefixsum[BLOCK_SIZE];
__device__ int D_Counts[BLOCK_SIZE];
__global__ void gpuMain() {
D_Counts[threadIdx.x] = test(D_Prefixsum[threadIdx.x]);
}
void checkErr(cudaError_t err) {
if(err != cudaSuccess) {
printf("cuda error: %s\n", cudaGetErrorString(err));
exit(0);
}
}
int H_Prefixsum[BLOCK_SIZE];
int H_Counts[BLOCK_SIZE];
int main(int argc, char *argv[]) {
srand(time(NULL));
int sum = 0;
for(int i=0; i<BLOCK_SIZE; i++) {
sum += (rand()%(BLOCK_SIZE*5));
H_Prefixsum[i ] = sum;
}
checkErr(cudaMemcpyToSymbol(D_Prefixsum, H_Prefixsum, sizeof(int)*BLOCK_SIZE));
cudaDeviceSynchronize();
gpuMain<<<1, BLOCK_SIZE>>>();
cudaDeviceSynchronize();
checkErr(cudaMemcpyFromSymbol(H_Counts, D_Counts, sizeof(int)*BLOCK_SIZE));
cudaDeviceSynchronize();
for(int i=0; i<BLOCK_SIZE; i++) {
printf("%d ", H_Counts[i ]);
}
printf("\n");
return 0;
}
```

Each thread will call test(), and *pSum* is the corresponding prefix-sum entry.

The binary search is at line 14~24, the search result is in variable *min*, which I will call the target entry

After the binary search, each thread will update its prefix-sum entry (to delete the finished work)

The problem is, at line 26 the control flow should not diverge

Because if the first nonzero entry of prefixsum is large enough, then all thread must have the same target entry.

So *count* should be the same among all threads.

However, when I try the code on a GTX750Ti, I got different *count* among threads

The CUDA version is 6.5

I also tried on a GTX770, and the code runs as expected (the same *count* among threads)

Moreover, after removing the __syncthreads() at line 33 and 42, I got correct result on GTX750Ti

So is this a bug or I misuse something?