Optimizing device memory operations.

Pretty sure this is a classic CUDA optimization problem and there’s someone here who lives for this sort of thing. I know I would if I were any good at it :)

This is running on the Jetson TX2.

Trying to optimize memory fetch operations as they’re a huge bottleneck in my code. Below is a simplified version of the code that I’m trying to optimize. I haven’t run this exact code and I make no guarantees that it’s free of typographical errors, but it should illustrate the concept.

Here is the unoptimized concept.

#define WIDTH 2048
#define HEIGHT 1024
#define DEPTH 256

__global__ void MapValues(uint32_t *offsetArray, uint32_t *mappingArray, uint32_t *destinationArray)
{
    unsigned x = blockDim.x*blockIdx.x+threadIdx.x;
    unsigned y = blockDim.y*blockIdx.y+threadIdx.y;

    uint32_t offset = offsetArray[y*WIDTH+x];
    uint32_t mappedValue = mappingArray[y*WIDTH*DEPTH+x*DEPTH+offset];
    destinationArray[y*WIDTH+x] = mappedValue;
}

void main()
{
    uint32_t *offsetArray;
    uint32_t *mappingArray;
    uint32_t *destinationArray;

    CHECK(cudaMalloc(&offsetArray, sizeof(uint32_t)*WIDTH*HEIGHT));
    CHECK(cudaMalloc(&mappingArray, sizeof(uint32_t)*WIDTH*HEIGHT*DEPTH));
    CHECK(cudaMalloc(&destinationArray, sizeof(uint32_t)*WIDTH*HEIGHT));

    //offset and mapping arrays get populated here. 

    dim3 blockSize (32, 16);
    dim3 blockCount ((WIDTH+blockSize.x)/blockSize.x, (HEIGHT+blockSize.y)/blockSize.y);
    for (int i = 0; i < 500; i++)
    {
        //Do some setup
        MapValues<<<blockCount, blockSize>>>(offsetArray, mappingArray, destinationArray);
        CHECK(cudaDeviceSynchronize());
        //Do something with the results
    }
    //clean up
}

There are other computations and such that take place within the global function, but these are the guts of it. Execution time for the device code is around 12 seconds. If I hardcode mappedValue, execution time drops to 1.5 seconds. If I then hardcode offset, execution time remains about the same. That last 1.5 seconds is stuck performing the write.

The array lookups are doing that unfortunate thing where there’s striding going on and we can’t take advantage of the pages of memory being retrieved. The offsetArray performance isn’t bad since the memory we’re accessing is mostly contiguous and likely getting pulled from L1 Cache with 100% utilization. I’ve tried flattening the dimensions to having the block.x be 256 and performance changes seem negligible.

I don’t think it’s worth doing any porting to shared memory for this as it’ll render my code very, very difficult to read as it’ll change the way that I’m scanning the arrays and require shuffling between device calls on the CPU.

I have also tried setting up arrays to exploit asynchronous prefetching, but I’m wondering if there are special ways of needing to code this to exploit the feature as I’m not seeing any performance improvements.

#define PREFETCH_SIZE 8

__global__ void MapValues(uint32_t *offsetArray, uint32_t *mappingArray, uint32_t *destinationArray)
{
    unsigned x = blockDim.x*blockIdx.x+threadIdx.x;
    unsigned y = blockDim.y*blockIdx.y+threadIdx.y;

    uint32_t offsets[PREFETCH_SIZE];
    uint32_t mappedValues[PREFETCH_SIZE];

    for (int i = 0; i < PREFETCH_SIZE; i++)
    {
        offsets[i] = offsetArray[(y+i)*WIDTH+x];
    }

for (int i = 0; i < PREFETCH_SIZE; i++)
    {
        mappedValues[i] = mappingArray[(y+i)*WIDTH*DEPTH+x*DEPTH+offsets[i]];
    }

    for (int i = 0; i < PREFETCH_SIZE; i++)
    {
        destinationArray[(y+i)*WIDTH+x] = mappedValues[i];
    }
}

The above code is based on this example, but perhaps using the array address is busting the performance boost I’d expect from the parallel loading.

temp = array[0];
for (i = 0; i < N-1; i++) {
    temp2 = array[i+1];
    sum += temp;
    temp = temp2;
}
sum += temp;

Any tips are appreciated. This runs about 10x the speed of the best optimization I could do on my i7 desktop, but my full computation still takes many hours to run and I think I’m leaving a lot of performance on the table with how I’m handling memory operations. I’m hoping that maybe it’s possible to directly move large chunks of data into faster cache memory and access that directly, or maybe there’s something really stupid that I’m doing that’s creating this bottleneck.