Shared memory issue with large processing

Hey all,

I’ve been fighting this for a while, and have been unable to resolve a problem that seems linked to shared memory.

I am attempting generate a lens flare effect using the entire image. So essentially I am trying to apply every pixel from my input image to every pixel in my output image. The code inside my global function looks like this.

global
void apply_sun_flare_v01_d( float * in, int inWidth, int inHeight,
float * out, int outWidth, int outHeight )
{
float sum = { 0.0f, 0.0f, 0.0f };
float norm = 0.0f;

int x = blockIdx.x * BLOCK_WIDTH + threadIdx.x;
int y = blockIdx.y * BLOCK_HEIGHT + threadIdx.y;

for( int blockRow = 0; blockRow < inHeight; blockRow += BLOCK_HEIGHT )
{
    int readY = blockRow + threadIdx.y;
    
    for( int blockCol = 0; blockCol < inWidth; blockCol += BLOCK_WIDTH )
    {
        int readX = blockCol + threadIdx.y;
        int index = ( readY * inWidth + readX ) * 3;
        
        __shared__ float inPix[BLOCK_HEIGHT][BLOCK_WIDTH][3];
        
        __syncthreads();
        
        // load the array of pixel colours ( R, G, B )
        inPix[threadIdx.y][threadIdx.x][0] = in[index+0];
        inPix[threadIdx.y][threadIdx.x][1] = in[index+1];
        inPix[threadIdx.y][threadIdx.x][2] = in[index+2];
        
        __syncthreads();
        
        for( int row = 0; row < BLOCK_HEIGHT; row++ )
        {
            for( int col = 0; col < BLOCK_WIDTH; col++ )
            {
                float diffInOut;
                diffInOut  = 1.0f - ( fabs( (float)( x - blockCol - col ) ) / (float)inWidth );
                diffInOut *= 1.0f - ( fabs( (float)( y - blockRow - row ) ) / (float)inHeight );
                
                sum[0] += inPix[row][col][0] * diffInOut;
                sum[1] += inPix[row][col][1] * diffInOut;
                sum[2] += inPix[row][col][2] * diffInOut;
                
                norm += diffInOut;
            }
        }
        __syncthreads();
    }
}

out[(y * outWidth + x) * 3 + 0] = sum[0] / norm;
out[(y * outWidth + x) * 3 + 1] = sum[1] / norm;
out[(y * outWidth + x) * 3 + 2] = sum[2] / norm;

}

(I have updated the code so it used the shared memory and compiles)

So input and output images are stored as RGB arrays of size
inputSize = inWidth * inHeight * 3 * sizeof(float)
outputSize = inWidth * inHeight * 3 * sizeof(float)

Where the in/out width is currently 1920 and the in/out height is currently 1080.
BLOCK_HEIGHT and BLOCK_WIDTH are both 8, although I have also tried 16.

The code as it is shown right now will either return nothing but zeros in the output buffer, or completely lock-up my system.

The code will run successfully if I either reduce the number of for loops:
for( int inRow = 0; inRow < BLOCK_HEIGHT2; inRow += BLOCK_HEIGHT )
for( int inCol = 0; inCol < BLOCK_WIDTH
2; inCol += BLOCK_WIDTH )
or simply set
int inRow = by * BLOCK_HEIGHT + ty;
int inCol = bx * BLOCK_WIDTH + tx;

Or remove the copy from input memory to shared memory
inPix[ty][tx][0] = ty;
inPix[ty][tx][1] = tx;
inPix[ty][tx][2] = 0;

So, somehow the two simply aren’t meshing well with each other. If anyone has a suggestion on what I am doing wrong here I would really appreciate it.

Thanks,
Brian!

Ok… I have an idea about what is going wrong. I get the error message:

error code 6
the launch timed out and was terminated

Which means my real problem is most likely with a build in timeout. It seems that the program stops at around 10 seconds. So how do I increase the limit… and do I want to?

Brian!

Maybe it would be a better idea to make it multi-pass, or improve the algorithm. 10s seems too long for a lens flare, you could use a parallel scan or something.

i haven’t checked the code, but time limit is 5 second not 10.

split a >5sec kernel into many small blocks.

There is one algorithmic problem I see: you need a __syncthreads() before you start writing to shared memory, otherwise some thread might finish it’s calculation and load the next value into shared memory before another thread got to read it. Of course, I guess that really isn’t a problem because each thread tx,ty only reads a single shared memory location: the same one it wrote at tx,ty.

Your code is clearly memory bound, and all of your memory accesses are not coalesced. Fix that and your performance will improve by a factor of 20 or more. Using a 2D texture would also be an option, though full coalescing should be possible in your code.

Lastly, let me get the numbers straight: inWidth = inHeight = 1920 and outWidth = outHeight = 1080?
So 10801080 threads each read 19201920 pixels at 12 bytes each. Someone correct me if I’m wrong but my calculator => this is 48,000 GB to read! At the full device bandwidth of 70GB/s, you are looking at a ~ 700 s run time.

I would suggest a more clever use of shared memory where you read a block of the image in ONCE into shared memory in the block, then apply that entire piece of the image to all of the pixels handled by the threads of the block. This would reduce the number of memory reads by a factor of the block size. With a 16x16 block, that would bring the kernel down to 2.6 s. Still long, but much more manageable.

Coalesced memory makes sense. I’ll give that a shot. Actually, it probably makes much more sense for me to process this image with a 1d warp, rather than 2d right?

I calculated it out to around 25 MB on input (1920108012) and 25 MB output.

It seems a bit intensive for a lens flare, but the goal is to generate the effect using all bright incoming pixels of an HDR image, rather than from a point light source. I’ll probably need to scale back the idea a bit.

Thanks,

Brian!

You probably need the 2d blocks because of the limitations on block indices. I’m not sure though. Other than these limitations, 2d indexing is really just a convenience.

I don’t know too much about the lens flare effect or how you are implementing it, I just read your kernel code. You seem to have each thread handling a single pixel of the output image => 19201080 threads. Each thread then has a double for loop that reads every pixel from the input. So you are reading all 25MB of the input 19201080 times. That is where I got the huge number from.

I dunno, but your code is making zero sense to me. e.g., why do you read a value into shared memory and then never use it again? you never use the blockIdx, so each block ends up doing the same thing and writing the same result (to the same spot in memory). And what the hell is up with undeclared variables ‘x’ and ‘y’ in “xIndex = x * 3; yIndex = y * outWidth * 3;”

Don’t post code that doesn’t compile and ask us what’s wrong with it. Not cool.