Compilation optimalisation

Hey gentle people again,

Currently I have a working project (a videodecoder). It’s a cpp-cuda integration project, using opengl<->cuda interop. I don’t read/write outside the buffers, and it’s calling 3 kernels for every frame.
It is showing the complete video-frame when I run in deviceemu, but when I go to release settings, it seems to me cudaThreadSynchronize(); gets ignored when compiling. Like the top 75% frame is correct and rest gets incorrect values.
After running a 30-40 frames I get a kernel launch failure (timeout). I think it queued too many kernels, and some need to complete when the rest of my program is already writing new information into one buffer or so? After each kernel invocation I added the cudaThreadSynchronize(); (in the main cuda file, the file that calls the kernels…)
This is why I think cudaThreadSynchronize gets compiled out, could be or not?

anyways here is the command line
(CUDA_BIN_PATH)\nvcc.exe" -ccbin "(VCInstallDir)bin” -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd -I"(CUDA_INC_PATH)" -I../include/ -I../src/ -I../../../libs/baseDecoder/include/ -o (ConfigurationName)\cuda.obj

Any thoughts?
Thx in advance and best regards,

p.s. :
system setup : Intel Core2 CPU 6300 @ 1,86ghz
1 gig ram

No, cudaThreadSynchronize() is not omitted when compiling Release build. You probably have error somewhere else.

And I should put it around the kernel calls like e.g.
KernelA<<<grid, threads>>>(m_video, dst);

You really only need to synchronize threads in order to time benchmarks.

When you fill the kernel execution queue, there is an implicit sync. If you copy memory host<->device (non-async) there is an implicit sync. I’m assuming that the opengl interop commands either are queued or also imply an implicit sync.