Hey all,
I’ve been fighting this for a while, and have been unable to resolve a problem that seems linked to shared memory.
I am attempting generate a lens flare effect using the entire image. So essentially I am trying to apply every pixel from my input image to every pixel in my output image. The code inside my global function looks like this.
global
void apply_sun_flare_v01_d( float * in, int inWidth, int inHeight,
float * out, int outWidth, int outHeight )
{
float sum = { 0.0f, 0.0f, 0.0f };
float norm = 0.0f;
int x = blockIdx.x * BLOCK_WIDTH + threadIdx.x;
int y = blockIdx.y * BLOCK_HEIGHT + threadIdx.y;
for( int blockRow = 0; blockRow < inHeight; blockRow += BLOCK_HEIGHT )
{
int readY = blockRow + threadIdx.y;
for( int blockCol = 0; blockCol < inWidth; blockCol += BLOCK_WIDTH )
{
int readX = blockCol + threadIdx.y;
int index = ( readY * inWidth + readX ) * 3;
__shared__ float inPix[BLOCK_HEIGHT][BLOCK_WIDTH][3];
__syncthreads();
// load the array of pixel colours ( R, G, B )
inPix[threadIdx.y][threadIdx.x][0] = in[index+0];
inPix[threadIdx.y][threadIdx.x][1] = in[index+1];
inPix[threadIdx.y][threadIdx.x][2] = in[index+2];
__syncthreads();
for( int row = 0; row < BLOCK_HEIGHT; row++ )
{
for( int col = 0; col < BLOCK_WIDTH; col++ )
{
float diffInOut;
diffInOut = 1.0f - ( fabs( (float)( x - blockCol - col ) ) / (float)inWidth );
diffInOut *= 1.0f - ( fabs( (float)( y - blockRow - row ) ) / (float)inHeight );
sum[0] += inPix[row][col][0] * diffInOut;
sum[1] += inPix[row][col][1] * diffInOut;
sum[2] += inPix[row][col][2] * diffInOut;
norm += diffInOut;
}
}
__syncthreads();
}
}
out[(y * outWidth + x) * 3 + 0] = sum[0] / norm;
out[(y * outWidth + x) * 3 + 1] = sum[1] / norm;
out[(y * outWidth + x) * 3 + 2] = sum[2] / norm;
}
(I have updated the code so it used the shared memory and compiles)
So input and output images are stored as RGB arrays of size
inputSize = inWidth * inHeight * 3 * sizeof(float)
outputSize = inWidth * inHeight * 3 * sizeof(float)
Where the in/out width is currently 1920 and the in/out height is currently 1080.
BLOCK_HEIGHT and BLOCK_WIDTH are both 8, although I have also tried 16.
The code as it is shown right now will either return nothing but zeros in the output buffer, or completely lock-up my system.
The code will run successfully if I either reduce the number of for loops:
for( int inRow = 0; inRow < BLOCK_HEIGHT2; inRow += BLOCK_HEIGHT )
for( int inCol = 0; inCol < BLOCK_WIDTH2; inCol += BLOCK_WIDTH )
or simply set
int inRow = by * BLOCK_HEIGHT + ty;
int inCol = bx * BLOCK_WIDTH + tx;
Or remove the copy from input memory to shared memory
inPix[ty][tx][0] = ty;
inPix[ty][tx][1] = tx;
inPix[ty][tx][2] = 0;
So, somehow the two simply aren’t meshing well with each other. If anyone has a suggestion on what I am doing wrong here I would really appreciate it.
Thanks,
Brian!