I have the below code structure which uses Cooperative groups.
The number of thread blocks is less than the total computation size and the code tries to to do grid-stride looping over the entire problem size.
Each block performs a different number of computations (size).
When the number of thread blocks launched are a multiple of the total blocks of computation, the program run without error.
However, when the number of thread blocks is not a multiple of the total blocks of computation, the execution returns garbage results.
All the threads in a block execute the same computations, so what is the mistake here?
while (timeStep < limit) {
for (int l = blockIdx.x; l < totalNumberOfLs; l += gridDim.x) {
int size;
if(threadIdx.x == 0) {
size = sizeStruct[l];
}
if(threadIdx.x < size) {
//load to shared memory
}
__syncthreads();
if(threadIdx.x < size) {
//computation
}
}
coopGrid.sync();
for (int l2 = blockIdx.x; l2 < totalNumberOfLs; l2 += gridDim.x) {
if (threadIdx.x == 0) {
//load to shared memory
}
__syncthreads();
if (threadIdx.x == 0) {
//computation
}
}
coopGrid.sync();
timeStep++;
}