Output correct whilst debugging, but incorrect when not

Hi guys, I have a weird problem. My output array shows incorrect data when i try to run it normally, but when I step through normally it shows the correct data that I get from a CPU program that runs the same algorithm.

The images will show you what I mean, leftImageD is from the GPU calculated kernel and leftImageDd is from when i step through debugging/or run the separate CPU program.

Also as you can see, the whole image isn’t covered by the depth information, I get memory errors when I try to run the whole image. So as a secondary problem I think I need some help with performance also!

I’m running Vista and I use a 8600m GS 256mb RAM.

The Visual Studio folder is attached (I use 2005).

Many thanks for any insight you guys can offer!

Paul
CUDA_TEST.rar (629 KB)

I’ve managed to get it returning correct values, by removing the shared memory usage and reading straight from global memory.

However the actual kernel still takes too long to run, would anyone have any performance tips?

Use shared memory and don’t forget to use __syncthreads (I think you’re problem could be there).