Hi there,
I am trying to get my head around the example ‘reduction.cpp’, and there is one issue I don’t understand.
To be specific, I will refer to the ‘reduce2’ implementation.
I will be working with large arrays, so that I have to invoke the kernel more than twice (i.e. the grid dimension on the first iteration is larger than the maximum number of threads). Otherwise I wouldn’t have a problem.
During the recursion, one should pass in the last array ‘g_idata’ of intermediate sums to each thread, and generate a new reduced array of gridDim.x values:
…
__syncthreads();
…
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
// This block finished so write to output array, may as well do this for threadIdx.x==0
…
The kernel should then be invoked repeatedly until only one block is left.
What I don’t understand is how use a single array for the intermediate values.
To the best of my knowledge, the execution order of each block in the grid is undefined.
If one uses the same array pointer for input and output, and overwrites the first gridDim.x elements, isn’t it possible that other blocks (with a lower blockIdx.x) will no longer get the right values?
Looking at the call in reduction.cpp, I see in the loop:
…
reduce(s, threads, blocks, kernel, d_odata, d_odata);
…
so the same pointer is used for input and output. Can someone please help me understand why this does not cause the problem I mentioned above? Or did I misunderstand, and the blocks actually run in sequence? Sorry if I am just not seeing something simple.
Many thanks, MT