Efficient 2D memory access for row-based operations on contiguous array

I’m experienced in C/C++ optimizations but relatively new to CUDA programming. I’m trying to port a multi-threaded CPU-based algorithm to the GPU and running into poor performance that seems related to memory access patterns. The single-threaded version of the CPU code I’m looking to port looks like this:

// 'in' and 'out' are of length 'numCols'
void applyToRow(const std::complex<float>* in,
                size_t numCols,
                const Info& info,
                std::complex<float>* out)
{
    double current = info.start;
    double delta = info.delta;
    for (size_t ii = 0; ii < numCols; ++ii)
    {
        out[ii] = std::complex<float>(in[ii].real() * (float)current, in[ii].imag() * (float)current;
        current *= delta;
    }
}

// 'in' and 'out' are numRows * numCols contiguous memory
// 'info' is of length numRows (i.e. varies per row)
void applyAll(const std::complex<float>* in, 
              size_t numRows,
              size_t numCols,
              const Info* info,
              std::complex<float>* out)
{
    for (size_t row = 0; row < numRows; ++row)
    {
        applyToRow(in, numCols, info[ii], out);
        in += numCols;
        out += numCols;
    }
}

Let’s simplify to a toy problem to illustrate the performance issue I’m having. I have a 2D array numRows x numCols stored in contiguous memory on the order of numRows = 8000, numCols = 10000. For this toy problem, let’s say life is simple and each pixel can be computed completely independently and that the input buffer has been allocated on the device. That means I can do something like this:

__global__
void ideal(size_t len, float* buffer)
{
    const size_t start = blockIdx.x * blockDim.x + threadIdx.x;
    const size_t stride = blockDim.x * gridDim.x;
    for (size_t ii = start; ii < len; ii += stride)
    {
        buffer[ii] *= 10.0f;
    }
}

Then I call this like…

const size_t BLOCK_SIZE = 256;
const size_t len = numRows * numCols;
const size_t numBlocks1D = (len + BLOCK_SIZE - 1) / BLOCK_SIZE;

ideal<<<numBlocks1D, BLOCK_SIZE>>>(len, buffer);

This runs fine. Now let’s say that I want to refactor this to treat it as a 2D array where a unit of work is performing the operation across all cols of an entire row. There’d be no reason to do this in the toy problem, but there is in the real problem where I have ‘info’ varying per row…

So, my initial approach was:

__global__
void twoDSimple(size_t numRows, size_t numCols, float* buffer)
{
    const size_t start = blockIdx.x * blockDim.x + threadIdx.x;
    const size_t stride = blockDim.x * gridDim.x;
    for (size_t row = start; row < numRows; row += stride)
    {
        float* rowPtr = buffer + row * numCols;
        for (size_t col = 0; col < numCols; ++col)
        {
            rowPtr[col] *= 10.0f;
        }
    }
}

Then calling like…

const size_t numBlocks2D = (numRows + BLOCK_SIZE - 1) / BLOCK_SIZE;
twoDSimple<<<numBlocks2D, BLOCK_SIZE>>>(numRows, numCols, buffer);

This implementation is horrendously slower for any settings of numRows and numCols, including when numCols is a nice multiple of 1024. My intuition from CPU programming is that this is a nice memory access pattern since I’m giving each thread contiguous memory to work on, but I think something horrible is happening from a memory coalescing perspective. So, I can flip the indexing around and still have each thread work on a single row, but instead have a “row” stride down instead of across:

__global__
void twoDFlipIndexing(size_t numRows, size_t numCols, float* buffer)
{
    const size_t start = blockIdx.x * blockDim.x + threadIdx.x;
    const size_t stride = blockDim.x * gridDim.x;
    for (size_t row = start; row < numRows; row += stride)
    {
        for (size_t col = 0, idx = row; col < numCols; ++col, idx += numRows)
        {
            buffer[idx] *= 10.0f;
        }
    }
}

Calling it the exact same way as the previous call:

const size_t numBlocks2D = (numRows + BLOCK_SIZE - 1) / BLOCK_SIZE;
twoDFlipIndexing<<<numBlocks2D, BLOCK_SIZE>>>(numRows, numCols, buffer);

This performs significantly better with performance approaching the original 1D indexing implementation. I assume that what’s happening is that adjoining threads are now all pulling from contiguous memory that is one element away instead of numCols away, even though my inner for loop is now striding by a large numRows value? Is this the recommended access pattern for this kind of problem? Are there any links someone can point me to discussing this (I can find plenty of general topics on efficient memory access but nothing directly speaking to this which I’d think is a common data access method). This approach is annoying in that I have to redo my indexing everywhere but is otherwise doable if it’s really the most efficient way to do this. Would I be better off allocating each row in a separate cudaMalloc() to avoid contiguous memory altogether (this will be problematic in other areas of my application where the image size has changed… right now I’m allocating one large ping/pong buffer set to the max size and reusing it throughout processing)?

Coming back to the original non-toy problem briefly, the alternative is to implement it as a 1D operation too, but now the current *= delta line becomes a pow():

__global__
void applyToRow(const std::complex<float>* in,
                size_t len,
                Info info,
                std::complex<float>* out)
{
    const size_t start = blockIdx.x * blockDim.x + threadIdx.x;
    const size_t stride = blockDim.x * gridDim.x;
    for (size_t ii = start; ii < len; ii += stride)
    {
        const double current = info.start * pow(info.delta, ii);
        out[ii] = std::complex<float>(in[ii].real() * (float)current, in[ii].imag() * (float)current;
    }
}

This call takes longer than twoDFlipIndexing() (but is faster than twoDSimple()).

Thanks for the design advice.
-Adam

Yes, you’re exploring coalesced vs. uncoalesced access on the GPU, and the difference can be monumental.

You generally want adjacent threads in a warp to access adjacent locations in memory. This can be readily accomplished if the index uses (amongst other things, perhaps) threadIdx.x without any multiplicative factors on it.

coalescing on the GPU is covered in many places, the following presentation goes through it in detail:

http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf

starting on slide 25 with emphasis on slides 36-45

For 2D problems, you want a given thread to run down a column, not across a row. Adjacent threads run down adjacent columns.

The profilers will also readily highlight the difference between the two cases, for example by asking for the gld_efficiency and gst_efficiency metrics for your kernel(s).

Thanks!

OK, this now makes sense - the pictures in that presentation make what was going on pretty clear now… this also implies that I should be adding a few pad columns to the end of each row to hit a multiple of 128 bytes so that I’m pulling in one cache line rather than two.

After I do several operations similar to this, I then want to take a 1D FFT across each row. I do this via cufftPlanMany(). Right now with my data organized where a “logical row” of the image is in contiguous memory, the stride between each input and output element is 1 and I can pass NULL in for the embed parameters. With this change, my data will be organized where a “logical row” of the image now has a stride of numCols between each element. I can represent that in the cufftPlanMany() call but I’m wondering if there will be a significant performance penalty for doing so.

I’m not sure if it will make much difference in the batched case. It should be more or less trivial to compare row-wise vs. column-wise FFTs.

However, the data transformation might not be necessary. It may be possible to arrange a group of threads to work cooperatively on a row - but I don’t know exactly what your row-wise calculation is, currently. In any event you would only need to consider this if you found CUFFT was significantly slower in the column-wise direction.

That’s encouraging regarding the expected FFT performance.

The single-threaded version of the CPU code I’m porting amounts to:

// 'in' and 'out' are of length 'numCols'
void applyToRow(const std::complex<float>* in,
                size_t numCols,
                const Info& info,
                std::complex<float>* out)
{
    double current = info.start;
    double delta = info.delta;
    for (size_t ii = 0; ii < numCols; ++ii)
    {
        out[ii] = std::complex<float>(in[ii].real() * (float)current, in[ii].imag() * (float)current;
        current *= delta;
    }
}

// 'in' and 'out' are numRows * numCols contiguous memory
// 'info' is of length numRows (i.e. varies per row)
void applyAll(const std::complex<float>* in, 
              size_t numRows,
              size_t numCols,
              const Info* info,
              std::complex<float>* out)
{
    for (size_t row = 0; row < numRows; ++row)
    {
        applyToRow(in, numCols, info[ii], out);
        in += numCols;
        out += numCols;
    }
}

So my initial naive attempt at doing this in CUDA was:

// 'in' is numRows x numCols
// 'out' is numRows x numCols
// 'info' is numRows
__global__
void applyAll(const std::complex<float>* in,
              const Info* info,
              size_t numRows,
              size_t numCols,
              std::complex<float>* out)
{
    const size_t start = blockIdx.x * blockDim.x + threadIdx.x;
    const size_t stride = blockDim.x * gridDim.x;
    for (size_t row = start; row < numRows; row += stride)
    {
        double current = info[row].start;
        const double delta = info[row].delta;
        const size_t offset = row * numCols;
        const std::complex<float>* inPtr = in + offset;
        std::complex<float>* outPtr = out + offset; 
        
        for (size_t col = 0; col < numCols; ++col)
        {
            outPtr[col] = std::complex<float>(inPtr[col].real() * (float)current,
                                              inPtr[col].imag() * (float)current);
            current *= delta;
        }
    }
}

const size_t numBlocks2D = (numRows + BLOCK_SIZE - 1) / BLOCK_SIZE;
applyAll<<<numBlocks2D, BLOCK_SIZE>>>(in, info, numRows, numCols, out);

This of course suffers from the memory coalescing problems you pointed out, so I’ll be changing it to index differently. In order for threads to be able to work cooperatively on a row, I’d need to be able to compute that ‘current’ value (which varies per row and col) more cheaply than brute-forcing it with ‘pow(delta, col)’. I’d looked into some examples of parallel prefix sum/scan, including Mark Harris’s presentation; this seems like a related calculation to what I’m trying to do with ‘current’, but I couldn’t quite get it to apply to this. I’m certainly interested in any ideas you have here.

Adam