Scattered memory reads, neighbor manipulation, and writes in 3D cubic data

Problem

Consider a cubic data distribution (assumed to be of float type) with dimensions Nx, Ny, and Nz along the x, y, and z axes, respectively, and let totN = Nx * Ny * Nz.

Scattered data

Suppose that around 30% of the totN data elements, randomly distributed throughout the domain, need to be manipulated with the help of their neighbors.

Data compression

To avoid data pollution, the new data will be compressed into a new array. The index mapping for this array is formed by performing a prefix sum on a state array (of bool type).

As a result, the new data will be stored contiguously.

A naive implementation

// a fake function to simulate a data manipulation depending on neighbors

__device__
void mathematical(float& origin, float xm, float xp, float ym, float yp, float zm, float zp )
{
    origin = xm + 2.f*xp - 3.f*ym + 4.f*yp - 5.f*zm + 6.f*zp
}

// parallel do the sampling
// Assume that 'state', 'globaldata', and 'prefixsum' have been computed and are ready.

__global__
void parallel_sample(int Nx, int Ny, int Nz, const bool* state, const float* globaldata, const int* prefixsum, float* destdata)
{
    // target data
    int originid = threadIdx.x + blockDim.x*blockIdx.x; 
    
    // around 30% of the data have a state of true.
    if( (originid >= (Nx*Ny*Nz)) || (!state[originid]) ) return;

    // data is stored in x, y and z order: id = x + y*Nx + z*Nx*Ny
    int xmid = originid - 1;
    int xpid = originid+1;
    int ymid = originid - Nx;
    int ypid = orignid + Nx;
    int zmid = originid - Nx*Ny;
    int zpid = originid + Nx*Ny;

    float srcdata = globaldata[originid];
    
    mathematical(srcdata, globaldata[xmid], globaldata[xpid], globaldata[ymid], globaldata[ypid], globaldata[zmid], globaldata[zpid];

    // compress all data with state == true to a new data
    destdata[prefixsum[originid]] = srcdata;
}

Some observations

  1. The target data elements are randomly distributed, meaning that they may or may not be adjacent to one another. This is why a new data array is needed to extract the results and avoid data pollution.

  2. Since the data elements are in random positions, global memory reads may not be coalesced, which could be a performance bottleneck.

One idea is to use shared memory: first load all global data into shared memory, then perform the mathematical manipulation on the shared data to guarantee coalesced reads. But I am not sure if the tradeoff between memory capacity and efficiency if OK.

  1. Other strategies might also help? such as using L1/constant caches via the restrict keyword or adjusting build options.

Thank you for your suggestions.

The constant memory only works well, if all threads is a warp read the same data.

The texture path can be an option. Either with texture instructions or __ldg().

I do not see full use of the prefix sum in your shown code. The thread ids do not map to the prefix sum. Instead you return, when state is false. You only use it for storing the result.

Such problems are difficult to optimize. You can only put small 3D arrays into shared memory, e.g. 32x32x16x4 bytes=64KiB. You have to handle border pixels in all directions and use overlapping blocks in shared memory.

You have to fight bank conflicts.

Probably either trusting the caches or the texture commands to load surrounding pixels are the best way.

As you said, constant memory may not be suitable here, since threads within a warp might not access the same data because the memory addresses can be far apart.

The prefix sum is used solely to map the compressed results.

Because each thread requires data from its neighbors, performing in-place result updates would cause data pollution.

But if some threads within a warp return early, you lose the math and memory bandwidth of those. They still occupy those resources.

Full warps returning would be okay.

So one better uses the prefixsum also to assign the point to each thread. In addition to the storage location of the result.

If the prefix sum is used for each thread, there might still be a problem with coalesced reads. Even though data with a state of true is compressed, the memory addresses from which they fetch data can still be far apart, which may degrade performance.

Yes, the reads are still difficult.

One possibility would be to divide the shared memory into its 32 banks and let each thread only read from its assigned bank. And use the threads on independent data. For storing you can resort.

So the shared memory would contain 32 small 3D arrays.

32x16x16x3x4 bytes = 96KiB.

One would move that 3D block along the direction with length 3.