reduction.cpp - block execution order

Hi there,

I am trying to get my head around the example ‘reduction.cpp’, and there is one issue I don’t understand.
To be specific, I will refer to the ‘reduce2’ implementation.

I will be working with large arrays, so that I have to invoke the kernel more than twice (i.e. the grid dimension on the first iteration is larger than the maximum number of threads). Otherwise I wouldn’t have a problem.
During the recursion, one should pass in the last array ‘g_idata’ of intermediate sums to each thread, and generate a new reduced array of gridDim.x values:


__syncthreads();

if (tid == 0) g_odata[blockIdx.x] = sdata[0];
// This block finished so write to output array, may as well do this for threadIdx.x==0

The kernel should then be invoked repeatedly until only one block is left.

What I don’t understand is how use a single array for the intermediate values.

To the best of my knowledge, the execution order of each block in the grid is undefined.

If one uses the same array pointer for input and output, and overwrites the first gridDim.x elements, isn’t it possible that other blocks (with a lower blockIdx.x) will no longer get the right values?

Looking at the call in reduction.cpp, I see in the loop:


reduce(s, threads, blocks, kernel, d_odata, d_odata);

so the same pointer is used for input and output. Can someone please help me understand why this does not cause the problem I mentioned above? Or did I misunderstand, and the blocks actually run in sequence? Sorry if I am just not seeing something simple.

Many thanks, MT

We’ve written our own version of this that not only finds the sum, but also the min and max of an array. The key is that each thread block saves only 1 set of data, regardless of how many threads were in it, and that one data is saved to a global array whose element is arrayptr[blockIdx.y][blockidx.x]. That way, no two thread blocks save data to the same location (remember, only one thread (threadIdx.y=0 and threadIdx.x=0 writes its results back to global memory). Then, the second call loads the same array, except there’s only one block (blockDim.x=1, blockDim.y=1), and the thread dimensions are now the same as the block dimension of the previous pass. The intermediate results are now loaded as arrayptr[threadIdx.y][threadIdx.x]. This gets quite confusing, and the first pass has to loop over the entire array being processed with a small number of thread blocks (which again, will define the number of threads in the next pass, with only one thread block)

Dear Matt,

many thanks for your reply.
However, you seem to be still describing the case where the reduction is complete after two loops (i.e. the intermediate array fits into one block after the first loop).

Given that one reduces the array length by a factor of nTPB (#threads per block, say 512) in each loop, this appears to place an upper bound of N = 512*512=262144 input elements.

For longer arrays, one must iterate the loop more than twice.
If one reuses a single intermediate array several times, overwriting a portion of the lowest-index values each time, it seems there is a chance to overwrite elements before they are used, e.g. blockIdx=1 executes before blockIdx=0 and overwrites the array element at position 1 before blockIdx=0 gets to copy the previous value to its shared memory array.

I only raise this issue, as the example in reduction.cpp does seem to reuse a single array.

In any case, I have now written a version which allocates up to two intermediate arrays (with sizes N/nTPB and N/(nTPB*nTPB)), and swaps the pointers between subsequent kernel calls (avoiding the array copies I was wary of). I use a couple of temp pointer variables and a not-so-beautiful if-statement set to streamline passing in the initial array, toggling the intermediate arrays and writing the final answer into a 1-element output array.

Assuming my point here was valid, I can post the code here if anyone cares to see it. (I would have done this now, but the code is part of a non-standard code for GPUmat and Matlab mex).

Still, I notice you are using 2D block indices (unlike the reduction.cpp example). Does this provide an elegant solution for arrays longer than nTPB*nTPB?

Regards, MT

Actually, the reason we wrote our version was because we needed to operate on images much larger than the allowed number of threads/block and blocks/grid (say a 40k x 40k pixel array). You can only launch a maximum of 65536 thread blocks, and if we had, say 16 x 16 thread blocks, that’s 256 x 65535 = 16.8 M pixels. Even the maximum of 1536 threads/block (on a 2.X device), this translates to 100M pixels In order to handle the 100M+ pixels we had, we re-arranged how the operation was done, and only requires two passes, irregardless of the data size.

Our method actually uses very few blocks, say 16 x 16 blocks. This is mostly driven by the thread block requirements, which work well for 16 x 16 arrangements, and the two have to be of the same size for the interim results. Assuming these dimensions (just as an example, we’ve made it variable in the actual code):

A thread figures out its position in the large array, and loads that pixel’s value:
threadIdx[0,0] blockIdx[0,0]=large array[0,0]
threadIdx[15,15] blockIdx[0,0]=large array[15,15]
threadIdx[0,0] blockIdx[0,1]=large array[0,16]
threadIdx[15,15] blockIdx[15,15]=large array[255,255]
Now we loop over the X and Y dimensions in each thread, incrementing the position in the large array by blockDim.xgridDim.x in the x-dimension and blockDim.ygridDim.y in the Y direction. At each point we add that pixel’s value to the current thread’s sum. Keep going until you reach the end of the ‘large array’ in each dimension. We now have the entire sum of the big array in the individual threads (65536 of them in this case). We proceed to ‘reduce’ the problem within each thread block just like the example. Next, save these intermediate results back to global memory. Now we call the second pass and do the final reduction of one thread block to just one thread.

Hi Matt,

thanks for taking the time to follow up my query again. I guess I understand now how you implemented your code, the key phrase being “we re-arranged how the operation was done”.

If I understand correctly, you have added an additional for-loop in the kernel, that performs a serial-add, before reducing the sum for a given block.
To be explicit (and flattening everything down to 1D block and grid dimensions), if your large array has N elements, and you choose a grid size gridDim and block size blockDim, each thread first sums up a set of Npresum = N/(gridDimblockDim) elements (striding across the large array in global memory in steps of (gridDimblockDim)).

By choosing blockDim >= gridDim, you are assured that the intermediate-sum array for the second kernel invocation fit into a single block.

So, when you say “A thread figures out its position in the large array, and loads that pixel’s value”, would it be more precise to say
“A thread first figures out its starting position in the large array and performs a (2D) for-loop, adding up a sparsely distributed set of elements from the large array…”?

I hadn’t thought of this possibility, mostly as I was stuck with the idea from the example codes that all array elements are first copied into a shared-memory array in a given thread before any add’s take place, and that a given thread-block only works on a contiguous set of data in the large array.

But for sure, I see that without adding this additional for-loop (either here, or outside the kernel), one would always be limited to <100 M elements.

Can you point to any of the SDK examples where this kind of thing is illustrated?

Kind regards, MT

P.S. I still don’t know if there is a bug in reduction.cpp, as per the OP!

Just read on until the reduction#7 example of the reduction whitepaper,