Problem
Consider a cubic data distribution (assumed to be of float type) with dimensions Nx, Ny, and Nz along the x, y, and z axes, respectively, and let totN = Nx * Ny * Nz.
Scattered data
Suppose that around 30% of the totN data elements, randomly distributed throughout the domain, need to be manipulated with the help of their neighbors.
Data compression
To avoid data pollution, the new data will be compressed into a new array. The index mapping for this array is formed by performing a prefix sum on a state array (of bool type).
As a result, the new data will be stored contiguously.
A naive implementation
// a fake function to simulate a data manipulation depending on neighbors
__device__
void mathematical(float& origin, float xm, float xp, float ym, float yp, float zm, float zp )
{
origin = xm + 2.f*xp - 3.f*ym + 4.f*yp - 5.f*zm + 6.f*zp
}
// parallel do the sampling
// Assume that 'state', 'globaldata', and 'prefixsum' have been computed and are ready.
__global__
void parallel_sample(int Nx, int Ny, int Nz, const bool* state, const float* globaldata, const int* prefixsum, float* destdata)
{
// target data
int originid = threadIdx.x + blockDim.x*blockIdx.x;
// around 30% of the data have a state of true.
if( (originid >= (Nx*Ny*Nz)) || (!state[originid]) ) return;
// data is stored in x, y and z order: id = x + y*Nx + z*Nx*Ny
int xmid = originid - 1;
int xpid = originid+1;
int ymid = originid - Nx;
int ypid = orignid + Nx;
int zmid = originid - Nx*Ny;
int zpid = originid + Nx*Ny;
float srcdata = globaldata[originid];
mathematical(srcdata, globaldata[xmid], globaldata[xpid], globaldata[ymid], globaldata[ypid], globaldata[zmid], globaldata[zpid];
// compress all data with state == true to a new data
destdata[prefixsum[originid]] = srcdata;
}
Some observations
-
The target data elements are randomly distributed, meaning that they may or may not be adjacent to one another. This is why a new data array is needed to extract the results and avoid data pollution.
-
Since the data elements are in random positions, global memory reads may not be coalesced, which could be a performance bottleneck.
One idea is to use shared memory: first load all global data into shared memory, then perform the mathematical manipulation on the shared data to guarantee coalesced reads. But I am not sure if the tradeoff between memory capacity and efficiency if OK.
- Other strategies might also help? such as using L1/constant caches via the restrict keyword or adjusting build options.
Thank you for your suggestions.