Erroneous cudppMultiScan() output for odd numbers of elements

Hello,

I have been using the cudppMultiScan() function as part of the CUDPP library. However, I am not able to get the correct result when I use a rowPitch value that is not a power of two.

I have written some simple code to demonstrate this problem. For example, if the dimension is 6, I get the following output:

1	   2	   3	   4	   1	   2

3	   4	   1	   2	   3	   4

1	   2	   3	   4	   1	   2

3	   4	   1	   2	   3	   4

5	   6	   3	   4	   5	   6

7	   8	   1	   2	   3	   4

As can be seen, the second (and subsequent) scans start too early in the array each time. If I run this code multiple times, I see the numbers change from time to time, indicating a race condition on the writeback. When I play around with the parameters, I suspect the problem is in the rowPitch parameter. (for example, when I add 1 to it, nothing changes, but adding 2 shifts the scan start position by 4). I am currently using the array width as the pitch.

I am using linear memory to store the array because I need to conduct a series of transposes with non-square arrays. Therefore, I cannot allocate 2D memory because the pitches may not match for the transpose. In the online documentation, using 2D memory is advised, but not stated as necessary.

Some simple sample code to generate the above input is as follows:

#include <iostream>

#include <string>

#include "cudpp.h"

using namespace::std;

// main()

int main (int argc, char **argv)

{

		int *d_in, *d_out;				// device pointers

		int dim = atoi(argv[1]);		// dimension of the array

		// DIM x DIM input array

		int source[dim*dim];

		for (int i = 0; i < dim*dim; i++)

				source[i] = 1;

		size_t mem_size = dim*dim*sizeof(int);

		// device buffers

		cudaMalloc( (void**) &d_in, mem_size );

		cudaMalloc( (void**) &d_out, mem_size );

		// copy source array from host to device

		cudaMemcpy( (void*) d_in, (void*) source, mem_size, cudaMemcpyHostToDevice );

		// scanPlan

		CUDPPHandle scanPlan;

		CUDPPConfiguration config = { CUDPP_SCAN, CUDPP_ADD, CUDPP_INT, CUDPP_OPTION_FORWARD | CUDPP_OPTION_INCLUSIVE };

		cudppPlan(&scanPlan, config, dim*dim, dim, dim);

		// do the scan

		cudppMultiScan(scanPlan, d_out, d_in, dim, dim);

		// copy data back

		cudaThreadSynchronize();

		cudaMemcpy( (void*) source, (void*) d_out, mem_size, cudaMemcpyDeviceToHost);

		cudaThreadSynchronize();

		// display it

		cout << endl;

		for (int i = 0; i < dim; i++)

		{

				for (int j = 0; j < dim; j++)

				{

						cout << source[i*dim+j] << "\t";

				}

				cout << endl;

		}

		return 0;

}