Hello,
I have been using the cudppMultiScan() function as part of the CUDPP library. However, I am not able to get the correct result when I use a rowPitch value that is not a power of two.
I have written some simple code to demonstrate this problem. For example, if the dimension is 6, I get the following output:
1 2 3 4 1 2
3 4 1 2 3 4
1 2 3 4 1 2
3 4 1 2 3 4
5 6 3 4 5 6
7 8 1 2 3 4
As can be seen, the second (and subsequent) scans start too early in the array each time. If I run this code multiple times, I see the numbers change from time to time, indicating a race condition on the writeback. When I play around with the parameters, I suspect the problem is in the rowPitch parameter. (for example, when I add 1 to it, nothing changes, but adding 2 shifts the scan start position by 4). I am currently using the array width as the pitch.
I am using linear memory to store the array because I need to conduct a series of transposes with non-square arrays. Therefore, I cannot allocate 2D memory because the pitches may not match for the transpose. In the online documentation, using 2D memory is advised, but not stated as necessary.
Some simple sample code to generate the above input is as follows:
#include <iostream>
#include <string>
#include "cudpp.h"
using namespace::std;
// main()
int main (int argc, char **argv)
{
int *d_in, *d_out; // device pointers
int dim = atoi(argv[1]); // dimension of the array
// DIM x DIM input array
int source[dim*dim];
for (int i = 0; i < dim*dim; i++)
source[i] = 1;
size_t mem_size = dim*dim*sizeof(int);
// device buffers
cudaMalloc( (void**) &d_in, mem_size );
cudaMalloc( (void**) &d_out, mem_size );
// copy source array from host to device
cudaMemcpy( (void*) d_in, (void*) source, mem_size, cudaMemcpyHostToDevice );
// scanPlan
CUDPPHandle scanPlan;
CUDPPConfiguration config = { CUDPP_SCAN, CUDPP_ADD, CUDPP_INT, CUDPP_OPTION_FORWARD | CUDPP_OPTION_INCLUSIVE };
cudppPlan(&scanPlan, config, dim*dim, dim, dim);
// do the scan
cudppMultiScan(scanPlan, d_out, d_in, dim, dim);
// copy data back
cudaThreadSynchronize();
cudaMemcpy( (void*) source, (void*) d_out, mem_size, cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
// display it
cout << endl;
for (int i = 0; i < dim; i++)
{
for (int j = 0; j < dim; j++)
{
cout << source[i*dim+j] << "\t";
}
cout << endl;
}
return 0;
}