cudaThreadSynchronize() does not make the CPU wait

Environment: Windows7, VS2008, QT4.5, CUDA 2.3


I have a c++ program in which a CUDA kernel is launched inside a for loop(at least the lauch is initiated there). The result of the launch is a rendered image which should be saved within the loop (in my c++ program) after rendering. So one iteration of the for loop is setting a parameter according to which a image should be rendered, rendering the image, and saving the resulting image (as soon as the kernal finished the render task). Thereby the rendering is done by an external program which renders using cuda.

I found that the cuda function cudaThreadSynchronize() should help in that case and makes the CPU wait until my previously launched CUDA kernel finishes.

So I wrote the following code in my .cpp file/class-method:

void MyWidgetClass::setTf(unsigned int index)


  for(unsigned int i = 0; i < myVector_.size(); i++)


	  QGradient gradient = myVector_[i]->getGradient();

	  tfEditor_->setGradient(gradient); // initiates a cuda kernel launched by emiting a QSignal - renders an image which is shown in a viewer_ widget


	  cudaThreadSynchronize(); // here my program on the CPU should wait for the external programs calculation on the GPU in each iteration of the for loop


	  QImage image = viewer_->grabFrameBuffer(); // get the rendered image an put it in an image;




Unfortunately I still get wrong images (which where rendered before) with this code when I save them. Is there anything special I should consider when using this function? Or should it work like that? cudaThreadSynchronize() also returns cudaSuccess, but still it does not seem to make the CPU wait.

Could it be a problem, that the CUDA kernel is not directly lauched within the block, but is just initiated by a QSignal which is emited by tfEditor_->setGradient(gradient);? Probably the cudaThreadSynchronize() does not find a cuda kernel to wait for and just proceeds? And would there be other ways to work around?


It sounds like you’re launching the kernel from a separate thread; this will not work. You can only synchronize from the same thread as the one currently holding the CUDA context.