The following function *sequentially* performs a prefix sum or scan on dim arrays of integers.

```
// dataPtr points the start of an array or arrays of integers for which the prefix sum is to be computed
// Example: [x1, x2 ... xN, y1, y2 ... yN, z1, z2 ... zN]
//
void func( int* radixPtr , int* dataPtr , short size , short dim )
{
void* tmpStorage = 0;
size_t offset = 0 , tmpStorageSize = 0;
cub::DeviceScan::ExclusiveSum( tmpStorage , tmpStorageSize , dataPtr , radixPtr , size );
allocateDeviceMemory( &tmpStorage , tmpStorageSize ); // wrapper for cudaMalloc()
// Sequential. TODO: parallelize
for( ushort i = 0; i < dim; ++i )
{
cub::DeviceScan::ExclusiveSum( tmpStorage , tmpStorageSize , dataPtr + offset , radixPtr + offset , size );
offset += size;
}
}
```

As each array is independent it ought to be possible to compute the prefix sums in parallel instead of in series. How would I go about doing this? I know streams are an option. What I’d like to know is if there are others.