Shared memory issue with large processing

bpmuser · September 19, 2007, 3:12pm

Hey all,

I’ve been fighting this for a while, and have been unable to resolve a problem that seems linked to shared memory.

I am attempting generate a lens flare effect using the entire image. So essentially I am trying to apply every pixel from my input image to every pixel in my output image. The code inside my global function looks like this.

global
void apply_sun_flare_v01_d( float * in, int inWidth, int inHeight,
float * out, int outWidth, int outHeight )
{
float sum = { 0.0f, 0.0f, 0.0f };
float norm = 0.0f;

int x = blockIdx.x * BLOCK_WIDTH + threadIdx.x;
int y = blockIdx.y * BLOCK_HEIGHT + threadIdx.y;

for( int blockRow = 0; blockRow < inHeight; blockRow += BLOCK_HEIGHT )
{
    int readY = blockRow + threadIdx.y;
    
    for( int blockCol = 0; blockCol < inWidth; blockCol += BLOCK_WIDTH )
    {
        int readX = blockCol + threadIdx.y;
        int index = ( readY * inWidth + readX ) * 3;
        
        __shared__ float inPix[BLOCK_HEIGHT][BLOCK_WIDTH][3];
        
        __syncthreads();
        
        // load the array of pixel colours ( R, G, B )
        inPix[threadIdx.y][threadIdx.x][0] = in[index+0];
        inPix[threadIdx.y][threadIdx.x][1] = in[index+1];
        inPix[threadIdx.y][threadIdx.x][2] = in[index+2];
        
        __syncthreads();
        
        for( int row = 0; row < BLOCK_HEIGHT; row++ )
        {
            for( int col = 0; col < BLOCK_WIDTH; col++ )
            {
                float diffInOut;
                diffInOut  = 1.0f - ( fabs( (float)( x - blockCol - col ) ) / (float)inWidth );
                diffInOut *= 1.0f - ( fabs( (float)( y - blockRow - row ) ) / (float)inHeight );
                
                sum[0] += inPix[row][col][0] * diffInOut;
                sum[1] += inPix[row][col][1] * diffInOut;
                sum[2] += inPix[row][col][2] * diffInOut;
                
                norm += diffInOut;
            }
        }
        __syncthreads();
    }
}

out[(y * outWidth + x) * 3 + 0] = sum[0] / norm;
out[(y * outWidth + x) * 3 + 1] = sum[1] / norm;
out[(y * outWidth + x) * 3 + 2] = sum[2] / norm;

}

(I have updated the code so it used the shared memory and compiles)

So input and output images are stored as RGB arrays of size
inputSize = inWidth * inHeight * 3 * sizeof(float)
outputSize = inWidth * inHeight * 3 * sizeof(float)

Where the in/out width is currently 1920 and the in/out height is currently 1080.
BLOCK_HEIGHT and BLOCK_WIDTH are both 8, although I have also tried 16.

The code as it is shown right now will either return nothing but zeros in the output buffer, or completely lock-up my system.

The code will run successfully if I either reduce the number of for loops:
for( int inRow = 0; inRow < BLOCK_HEIGHT2; inRow += BLOCK_HEIGHT )
for( int inCol = 0; inCol < BLOCK_WIDTH2; inCol += BLOCK_WIDTH )
or simply set
int inRow = by * BLOCK_HEIGHT + ty;
int inCol = bx * BLOCK_WIDTH + tx;

Or remove the copy from input memory to shared memory
inPix[ty][tx][0] = ty;
inPix[ty][tx][1] = tx;
inPix[ty][tx][2] = 0;

So, somehow the two simply aren’t meshing well with each other. If anyone has a suggestion on what I am doing wrong here I would really appreciate it.

Thanks,
Brian!

bpmuser · September 19, 2007, 7:17pm

Ok… I have an idea about what is going wrong. I get the error message:

error code 6
the launch timed out and was terminated

Which means my real problem is most likely with a build in timeout. It seems that the program stops at around 10 seconds. So how do I increase the limit… and do I want to?

Brian!

asadafag · September 20, 2007, 3:26am

Maybe it would be a better idea to make it multi-pass, or improve the algorithm. 10s seems too long for a lens flare, you could use a parallel scan or something.

yk_cadcg · September 20, 2007, 1:01pm

i haven’t checked the code, but time limit is 5 second not 10.

split a >5sec kernel into many small blocks.

MisterAnderson42 · September 20, 2007, 1:28pm

There is one algorithmic problem I see: you need a __syncthreads() before you start writing to shared memory, otherwise some thread might finish it’s calculation and load the next value into shared memory before another thread got to read it. Of course, I guess that really isn’t a problem because each thread tx,ty only reads a single shared memory location: the same one it wrote at tx,ty.

Your code is clearly memory bound, and all of your memory accesses are not coalesced. Fix that and your performance will improve by a factor of 20 or more. Using a 2D texture would also be an option, though full coalescing should be possible in your code.

Lastly, let me get the numbers straight: inWidth = inHeight = 1920 and outWidth = outHeight = 1080?
So 10801080 threads each read 19201920 pixels at 12 bytes each. Someone correct me if I’m wrong but my calculator => this is 48,000 GB to read! At the full device bandwidth of 70GB/s, you are looking at a ~ 700 s run time.

I would suggest a more clever use of shared memory where you read a block of the image in ONCE into shared memory in the block, then apply that entire piece of the image to all of the pixels handled by the threads of the block. This would reduce the number of memory reads by a factor of the block size. With a 16x16 block, that would bring the kernel down to 2.6 s. Still long, but much more manageable.

bpmuser · September 21, 2007, 2:20pm

Coalesced memory makes sense. I’ll give that a shot. Actually, it probably makes much more sense for me to process this image with a 1d warp, rather than 2d right?

I calculated it out to around 25 MB on input (1920108012) and 25 MB output.

It seems a bit intensive for a lens flare, but the goal is to generate the effect using all bright incoming pixels of an HDR image, rather than from a point light source. I’ll probably need to scale back the idea a bit.

Thanks,

Brian!

MisterAnderson42 · September 21, 2007, 9:58pm

You probably need the 2d blocks because of the limitations on block indices. I’m not sure though. Other than these limitations, 2d indexing is really just a convenience.

I don’t know too much about the lens flare effect or how you are implementing it, I just read your kernel code. You seem to have each thread handling a single pixel of the output image => 19201080 threads. Each thread then has a double for loop that reads every pixel from the input. So you are reading all 25MB of the input 19201080 times. That is where I got the huge number from.

alex_dubinsky · September 22, 2007, 2:16am

global

void apply_sun_flare_v01_d( float * in, float * inSphere, int inWidth, int inHeight,

float * out, int outWidth, int outHeight, float outHFov )

{

const int bx = blockIdx.x;

const int by = blockIdx.y;

const int tx = threadIdx.x;

const int ty = threadIdx.y;

int xIndex;

int yIndex;

for( int inRow = 0; inRow < inHeight; inRow += BLOCK_HEIGHT )

{

yIndex = ( inRow + ty ) * inWidth * 3;

for( int inCol = 0; inCol < inWidth; inCol += BLOCK_WIDTH )

{

xIndex = ( inCol + tx ) * 3;

shared float inPix[BLOCK_HEIGHT][BLOCK_WIDTH];

inPix[ty][tx][0] = in[yIndex + xIndex + 0];

inPix[ty][tx][1] = in[yIndex + xIndex + 1];

inPix[ty][tx][2] = in[yIndex + xIndex + 2];

__syncthreads();

sum[0] += inPix[ty][tx];

sum[1] += inPix[ty][tx];

sum[2] += inPix[ty][tx];

}

}

xIndex = x * 3;

yIndex = y * outWidth * 3;

out[yIndex + xIndex + 0] = sum[0];

out[yIndex + xIndex + 1] = sum[1];

out[yIndex + xIndex + 2] = sum[2];

}

I dunno, but your code is making zero sense to me. e.g., why do you read a value into shared memory and then never use it again? you never use the blockIdx, so each block ends up doing the same thing and writing the same result (to the same spot in memory). And what the hell is up with undeclared variables ‘x’ and ‘y’ in “xIndex = x * 3; yIndex = y * outWidth * 3;”

Don’t post code that doesn’t compile and ask us what’s wrong with it. Not cool.

Topic		Replies	Views
Shared memory problem CUDA Programming and Performance	3	2259	February 8, 2008
performance unchanged by shared memory 1D heat equation attempt CUDA Programming and Performance	15	4860	July 12, 2009
Shared Memory Limitation CUDA Programming and Performance	8	4781	February 4, 2009
CUDA performance How to improve my code? CUDA Programming and Performance	16	13574	June 29, 2010
Shared vs. texture memory CUDA Programming and Performance	6	3176	April 18, 2009
help getting shared memory working CUDA Programming and Performance	11	4289	June 12, 2007
Shared Memory Problems - __syncthreads() doesn't work? CUDA Programming and Performance	5	2578	December 29, 2011
Disappointing shared memory performance CUDA Programming and Performance	3	734	September 8, 2011
Shared memory question CUDA Programming and Performance	27	7315	June 23, 2008
Dynamically allocated shared memory CUDA Programming and Performance	5	3920	February 12, 2009

Shared memory issue with large processing

Related topics