REDUCE algorithm

hey peoples once again. im still in the early research state on CUDA.
i’ve looked into some examples provided by the website and one really usefull one caught my eye

its the simple reduce example.

the code is very nice, compact and most of all quick with its Θ(log(N)) which makes it desirable to use in my future programs. ( alothough it works only on arrays of size (mod 2^x = 0|{x = 0,1,2…}) but theres alot of workarounds this issue)

One thing i still dont get though:
i went a little bit further in tweaking the code in order to work with 2D arrays and result in an output array that gives all the sums of the previous elements of the row. (i.e. input[Y] would produce an output)
i fixed the code a bit but it produces incorrect results - more percisely it skips all the odd rows of my 2D array.

Pseudo code:

  1. pick the row
  2. throw it into shared memory
  3. reduce until you get the desired sum
  4. throw the result in the output array in the corresponding index of row
  5. repeat until run out of rows

kernel invocations are:
dim3 dimB (W); // where W is the width of the 2D array
dim3 dimG (H/dimB.x); // where H is the height of the 2D array
/therefore/
reduce <<<dimG,dimB>>> (device1, device2, W);

someone has any idea why my kernel is not behaving correct?

apparently when i started debugging - the data in the shared memory seemes to be inconsistent skipping an odd row for some reason

i’m really sorry to bring up this thread, but looking more closely to the original reduce function on CPU with same propeties - i’ve located the mistake and now it works perfectly!!

CUDA is fun :)