I am currently working at running a Sobel filter on the GPU for use in a larger algorithm. I’m currently trying to write the result of the Sobel filter to a global array to verify that I have this stage working, but I’m running into some problems I can’t understand. The code I’m working with is below:
// Indexes that make thread references easier.
#define GLOBAL_X (blockDim.x*blockIdx.x + threadIdx.x)
#define GLOBAL_Y (blockDim.y*blockIdx.y + threadIdx.y)
// Pixel fetching placeholders.
#define PIX00 tex2D(tex, GLOBAL_X - 1, GLOBAL_Y - 1)
#define PIX01 tex2D(tex, GLOBAL_X + 0, GLOBAL_Y - 1)
#define PIX02 tex2D(tex, GLOBAL_X + 1, GLOBAL_Y - 1)
#define PIX10 tex2D(tex, GLOBAL_X - 1, GLOBAL_Y + 0)
#define PIX11 tex2D(tex, GLOBAL_X + 0, GLOBAL_Y + 0)
#define PIX12 tex2D(tex, GLOBAL_X + 1, GLOBAL_Y + 0)
#define PIX20 tex2D(tex, GLOBAL_X - 1, GLOBAL_Y + 1)
#define PIX21 tex2D(tex, GLOBAL_X + 0, GLOBAL_Y + 1)
#define PIX22 tex2D(tex, GLOBAL_X + 1, GLOBAL_Y + 1)
__global__ void
edge_kernel(unsigned char *result, unsigned char *absolute, int w, int h,
float fScale)
{
short Horz, Vert, Sum;
if(GLOBAL_X<w && GLOBAL_Y<h)
{
Horz = PIX02 + 2*PIX12 + PIX22 - PIX00 - 2*PIX10 - PIX20;
Vert = PIX00 + 2*PIX01 + PIX02 - PIX20 - 2*PIX21 - PIX22;
Sum = (short) (fScale*(abs((int)Horz)+abs((int)Vert)));
if(Sum<0)
{
absolute[w*GLOBAL_Y + GLOBAL_X] = 0x00;
} else if(Sum>0xff)
{
absolute[w*GLOBAL_Y + GLOBAL_X] = 0xff;
} else
{
absolute[w*GLOBAL_Y + GLOBAL_X] = (unsigned char) Sum;
}
}
result[w*GLOBAL_Y + GLOBAL_X] = absolute[w*GLOBAL_Y + GLOBAL_X];
}
Originally I had the line writing to the result array within the outermost conditional, but when I copied the array back to the host it would contain only zeros. After discovering this, I made plans to remove the conditional entirely (it’s only there now to make sure I don’t write out of the image bounds, which I can also avoid by padding the image dimensions to multiples of my block dimensions). However, if I remove the outermost conditional around the remaining code, I get the same problem of only zeros being copied to the return array. I have tried throwing in __syncthreads() before the result assignment, but that didn’t work. After fiddling around with the code, I have come to realize that zeros are written to the result array whenever the register count per thread exceeds 16. My block dimensions are 32x16, which limits me to 16 registers per thread without going to local memory since the GeForce 9500M GS I’m running this on only has 8192 registers available per block. Strangely, local memory doesn’t seem to be a very good explanation either, since when I decreased my block dimensions to 16x16 the same problem occurred).
When I examine the ptxas output, this is what I get:
Compiling including the outermost conditional:
2>ptxas info : Compiling entry function '_Z10lpf_kernelPhS_iij' for 'sm_10'
2>ptxas info : Used 0 registers, 20+16 bytes smem
2>ptxas info : Compiling entry function '_Z11edge_kernelPhS_iijf' for 'sm_10'
2>ptxas info : Used 16 registers, 24+16 bytes smem, 8 bytes cmem[1]
Compiling after removing the outermost conditional:
2>ptxas info : Compiling entry function '_Z10lpf_kernelPhS_iij' for 'sm_10'
2>ptxas info : Used 0 registers, 20+16 bytes smem
2>ptxas info : Compiling entry function '_Z11edge_kernelPhS_iijf' for 'sm_10'
2>ptxas info : Used 18 registers, 24+16 bytes smem, 8 bytes cmem[1]
My questions are, why does this arrangement of statements inside and outside the outermost conditional change my register count, and why are my results changing when the register count goes above 16 even when I ought to be able to support 32 registers per thread with block dimensions of 16x16?
I am running Windows 7 64-bit with Visual Studio 2008, using the CUDA SDK/Toolkit 4.0 and compiling using the Debug Win32 configuration. As I mentioned before, my card is the GeForce 9500M GS.
If you need any more info to help me with the problem let me know.
Thanks.