threads and blocks and dcts newby having problem with simple idct

hi guys,
I’m a total newby here with this cuda stuff working on my first cuda program… and I’ve got lots of problems… I’ve been working with the sample applications and such, and i’ve successfully been able to integrate some of the small samples into my apps for practice, but my first “real” app is being somewhat troublesome:

I’m trying to write a simple iDCT calculator for integration with an open source jpeg library. I’ve got 2 8800 gtx cards…
my first stab here, just decodes line by line, I do a single idct with each thread, and I have 160 idct’s per line. so (after malloc’ing and such) I do the following:

dim3 grid( 1, 1, 1);
dim3 threads( NumDCTBlocks, 1, 1);
Cuda_iDCT8x8_kernel<<<grid, threads>>>(Srcd,Dstd,dctsizebytes);

where NumDCTBlocks = 160.

in my kernel, I try to calculate the pointer to the dct the current thread will use and calculate a single idct like so:

const int tid = threadIdx.x;
const int bid = blockIdx.x;
const int dim = blockDim.x;


idct(g_idata+(tid*dctsizebytes), (unsigned char )g_odata+(tiddctsizebytes));

__syncthreads();

where g_idata is the input buffer for the line, and g_odata is the output buffer for the line.

SO:

what happens is that the buffer gets written to, but the next time I try to set up for doing the next dct, I get a crash. specifically, when I call the huff decode for the second mcu block of the next line…

also: when I try to compile with debug emulation mode, it hangs after launching 31 threads… It hangs at the call to __syncthreads();

note that I’m not going for optimal here yet, just functional…

HELP?

am I doing something dumb here?
(note: I would not be surprised!)
<img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />
do I need to break it down into blocks?

in the devicequery sample, it detects the max threads as 512? shouldn’t I be ok?

any help you all could give me would make me a happy guy!
thanks !
-thomas