Newbish shared memory problem

I am trying to learn how to use shared memory for image filtering operations (e.g. Sobel 3x3), and i am having some trouble. My kernel keeps exiting on from what I can tell is an access violation (this is in EmuDebug mode), but the compiler fails to break where the exception was thrown. It gets well into a few thread blocks worth of execution before it crashes. The input is a 1024x768 floating point mono image. How do i get the compiler to break on and access violation exception in a kernel? Can anyone see some fundamental thing wrong with this code that could cause this?

#define BLOCK_SIZE 16

texture<float, 2, cudaReadModeElementType> texData;

#define IDX(p,x,y) (IMUL(p,y) + x)

global void conv_sobel_sharedmem_dx_f(float* d_data, int d_pitch, int width, int height)
//x y adress within the image corresponding to the thread
const int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x - 1;
const int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y - 1;
//texture coord into the image
const float itx = (float)ix + 0.5f;
const float ity = (float)iy + 0.5f;

if(ix >= width || iy >= height)

//shared memory for the entire image bloack and the apron
__shared__ float shared_data[BLOCK_SIZE * BLOCK_SIZE];

//just for clarity
const int spitch = blockDim.x;
const int sx = threadIdx.x;
const int sy = threadIdx.y;

//load into shared memory
shared_data[IDX(spitch,sx,sy)] = tex2D(texData,itx,ity);


if((threadIdx.x == 0) || (threadIdx.x == blockDim.x-1) || 
   (threadIdx.y == 0) || (threadIdx.y == blockDim.y-1))

d_data[d_pitch*iy+ix] = (shared_data[IDX(spitch,sx-1,sy-1)] * -1.0 + 
	                shared_data[IDX(spitch,sx  ,sy-1)] *  1.0 + 
			shared_data[IDX(spitch,sx+1,sy-1)] * -2.0 +
			shared_data[IDX(spitch,sx-1,sy+1)] *  2.0 +
	                shared_data[IDX(spitch,sx  ,sy+1)] * -1.0 +
			shared_data[IDX(spitch,sx+1,sy+1)] *  1.0)/9.0f;


gvECode gvSobelImageSharedMem_f(gvImage* pDest, gvImage* pSrc, int dir)
gvECode ret;



//image block is 14x14 for each thread block of 16x16
int2 grid_size;
grid_size.x = iDivUp(pDest->width, (BLOCK_SIZE - 2));
grid_size.y = iDivUp(pDest->height, (BLOCK_SIZE - 2));

dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(grid_size.x, grid_size.y);

conv_sobel_sharedmem_dx_f<<<grid,threads>>>((float*)(pDest->pixels), pDest->pitch, pDest->width, pDest->height);

// check if kernel execution generated and error
	CUT_CHECK_ERROR("Kernel execution failed");

CUDA_SAFE_CALL( cudaThreadSynchronize() );

return GV_SUCCESS;


:( :(

Ah ha, you can’t call __syncthreads in any kind of branch! Moving the first if to after the sync seems to fix the problem. The function still doesn’t work but at least it doesn’t crash.

From what I’ve seen while debugging my code (always EmuDebug mode, my devel machine doesn’t even have a GeForce inside),

when in debug mode, the threads are executed sequentially :

  • The application creates 1 thread for the main C/C++ code, and then 1 thread for each member of a group.

  • First thread is entirely ran until it reaches a “__synchtread();”, then the second, etc…

  • Once all the thread have reached that limit, the next chunk of code is again executed sequentially.

So setting breakpoints at the various “__synchthread();”

  • The grid is walked serially.

Thus, the “__syncthreads” are a good place to put your break point, so at least you know during which phase your code breaks (loading into shared or processing the data).

Also, the CUDA code is pasted as-is in the CPU debug code. Thus, unlike real device code, with debug enabled you can call any C function you like from your code.

Thus you could put

if (0 == threadIdx.x) printf("some status info : %d %d", etc....

all over your code and follow it’s execution.

Sorry, it’s not very clear from the context :

“pixels” is a pointer to a 2D device memory block allocated with “cudaMallocPitch” ?

From what I’ve understood in the manual, that’s the only type of memory you can write to.

Texture write (BrookGPU kernels style) aren’t available in CUDA yet.

Also, as long as you cache the data on the shared memory, I don’t think it brings any further acceleration to further use the cached texture look-up units. I would recommend storing the data as similarily cudaMallocPitch’ed memory buffers.

I tried that here, and i was having problems.