I’m experienced in C/C++ optimizations but relatively new to CUDA programming. I’m trying to port a multi-threaded CPU-based algorithm to the GPU and running into poor performance that seems related to memory access patterns. The single-threaded version of the CPU code I’m looking to port looks like this:
// 'in' and 'out' are of length 'numCols'
void applyToRow(const std::complex<float>* in,
size_t numCols,
const Info& info,
std::complex<float>* out)
{
double current = info.start;
double delta = info.delta;
for (size_t ii = 0; ii < numCols; ++ii)
{
out[ii] = std::complex<float>(in[ii].real() * (float)current, in[ii].imag() * (float)current;
current *= delta;
}
}
// 'in' and 'out' are numRows * numCols contiguous memory
// 'info' is of length numRows (i.e. varies per row)
void applyAll(const std::complex<float>* in,
size_t numRows,
size_t numCols,
const Info* info,
std::complex<float>* out)
{
for (size_t row = 0; row < numRows; ++row)
{
applyToRow(in, numCols, info[ii], out);
in += numCols;
out += numCols;
}
}
Let’s simplify to a toy problem to illustrate the performance issue I’m having. I have a 2D array numRows x numCols stored in contiguous memory on the order of numRows = 8000, numCols = 10000. For this toy problem, let’s say life is simple and each pixel can be computed completely independently and that the input buffer has been allocated on the device. That means I can do something like this:
__global__
void ideal(size_t len, float* buffer)
{
const size_t start = blockIdx.x * blockDim.x + threadIdx.x;
const size_t stride = blockDim.x * gridDim.x;
for (size_t ii = start; ii < len; ii += stride)
{
buffer[ii] *= 10.0f;
}
}
Then I call this like…
const size_t BLOCK_SIZE = 256;
const size_t len = numRows * numCols;
const size_t numBlocks1D = (len + BLOCK_SIZE - 1) / BLOCK_SIZE;
ideal<<<numBlocks1D, BLOCK_SIZE>>>(len, buffer);
This runs fine. Now let’s say that I want to refactor this to treat it as a 2D array where a unit of work is performing the operation across all cols of an entire row. There’d be no reason to do this in the toy problem, but there is in the real problem where I have ‘info’ varying per row…
So, my initial approach was:
__global__
void twoDSimple(size_t numRows, size_t numCols, float* buffer)
{
const size_t start = blockIdx.x * blockDim.x + threadIdx.x;
const size_t stride = blockDim.x * gridDim.x;
for (size_t row = start; row < numRows; row += stride)
{
float* rowPtr = buffer + row * numCols;
for (size_t col = 0; col < numCols; ++col)
{
rowPtr[col] *= 10.0f;
}
}
}
Then calling like…
const size_t numBlocks2D = (numRows + BLOCK_SIZE - 1) / BLOCK_SIZE;
twoDSimple<<<numBlocks2D, BLOCK_SIZE>>>(numRows, numCols, buffer);
This implementation is horrendously slower for any settings of numRows and numCols, including when numCols is a nice multiple of 1024. My intuition from CPU programming is that this is a nice memory access pattern since I’m giving each thread contiguous memory to work on, but I think something horrible is happening from a memory coalescing perspective. So, I can flip the indexing around and still have each thread work on a single row, but instead have a “row” stride down instead of across:
__global__
void twoDFlipIndexing(size_t numRows, size_t numCols, float* buffer)
{
const size_t start = blockIdx.x * blockDim.x + threadIdx.x;
const size_t stride = blockDim.x * gridDim.x;
for (size_t row = start; row < numRows; row += stride)
{
for (size_t col = 0, idx = row; col < numCols; ++col, idx += numRows)
{
buffer[idx] *= 10.0f;
}
}
}
Calling it the exact same way as the previous call:
const size_t numBlocks2D = (numRows + BLOCK_SIZE - 1) / BLOCK_SIZE;
twoDFlipIndexing<<<numBlocks2D, BLOCK_SIZE>>>(numRows, numCols, buffer);
This performs significantly better with performance approaching the original 1D indexing implementation. I assume that what’s happening is that adjoining threads are now all pulling from contiguous memory that is one element away instead of numCols away, even though my inner for loop is now striding by a large numRows value? Is this the recommended access pattern for this kind of problem? Are there any links someone can point me to discussing this (I can find plenty of general topics on efficient memory access but nothing directly speaking to this which I’d think is a common data access method). This approach is annoying in that I have to redo my indexing everywhere but is otherwise doable if it’s really the most efficient way to do this. Would I be better off allocating each row in a separate cudaMalloc() to avoid contiguous memory altogether (this will be problematic in other areas of my application where the image size has changed… right now I’m allocating one large ping/pong buffer set to the max size and reusing it throughout processing)?
Coming back to the original non-toy problem briefly, the alternative is to implement it as a 1D operation too, but now the current *= delta line becomes a pow():
__global__
void applyToRow(const std::complex<float>* in,
size_t len,
Info info,
std::complex<float>* out)
{
const size_t start = blockIdx.x * blockDim.x + threadIdx.x;
const size_t stride = blockDim.x * gridDim.x;
for (size_t ii = start; ii < len; ii += stride)
{
const double current = info.start * pow(info.delta, ii);
out[ii] = std::complex<float>(in[ii].real() * (float)current, in[ii].imag() * (float)current;
}
}
This call takes longer than twoDFlipIndexing() (but is faster than twoDSimple()).
Thanks for the design advice.
-Adam