Hey,
So essentially the program takes in a search string, a number of blocks, and a number of threads per block and searches the file in parallel for the string. The way it was initially written for us is that all the results are put into global memory and then iterated through. Our task is to perform a shared memory reduction within the thread blocks and then add the blocks globally to save time. Rather than post the entire file ill post the snippets which I added to do the reduction. I left the global memory writes in there for now so I could compare. The global memory contains the correct number of matches but the shared memory varies depending on the blocks and threads. For example it works for 4 blocks with 4 threads per block but not for 4 blocks and 16 threads per block.
//At the top
extern __shared__ int sdata[];
//After the number of matches has been found
matchArray[idx] = numMatches;
sdata[tid] = numMatches;
__syncthreads();
for (unsigned int s=1;s<blockDim.x;s*=2)
{
if (tid%(2*s)==0)
{
sdata[tid] += sdata[tid+s];
}
__syncthreads();
}
if (tid==0)
{
blockArray[blockIdx.x]=sdata[0];
}
After this reduction two seperate loops add up each of the values
cudaMemcpy(hostMatchArray, matchArray, sizeof(int)*numThreads, cudaMemcpyDeviceToHost);
cudaMemcpy(hostBlockArray, blockArray, sizeof(int)*nBlocks, cudaMemcpyDeviceToHost);
int total = 0;
int total2=0;
for(i = 0; i < numThreads; i++)
{
total += hostMatchArray[i];
}
for(i = 0; i < nBlocks; i++)
{
total2 += hostBlockArray[i];
}
For 4 blocks and 16 threads per block I get 7 matches for the global (which is correct) but 3 for the blocks added up. I assume im doing something wrong that should be easy to see but im missing it. I also tried just iterating shared memory for a thread with tid=0 but got the exact same results as the reduction. If i need to post the entire file I suppose I could. Any help is greatly appreciated.
Edit: It really only seems to break on any power of 2 threads per block greater than 8. So ive tried various combinations of blocks/threads per block for 2 4 and 8 and all seem to work. But any number of blocks with 16 or 32 threads per block give me an invalid number of matches.