cudaErrorLauchTimeout after cudaMemcpy

I implemented a simple ray tracer. Everything works fine until I increase the data size. The program works fine after several cudaMalloc, cudaMemcpy(host to device) and kernel launch. But after I call cudaMemcpy (from device to host), the whole display driver shutdown and restart, and I get a cudaErrorLaunchTimeout error.

I check cuda’s api return value every step (including kernel launch), so this error should be caused by the cudaMemcpy() call.

If I decrease my data size, this error won’t happen. But I don’t think I’m running out of memories.

Any one have an idea?

It’s almost certainly caused by the time spent in your kernel launch unless you have a cudaThreadSynchronize() between the kernel launch and the cudaMemcpy that is returning success. (alternately you could put a cudaGetLastError() immediately after the kernel launch and run with the environment variable CUDA_LAUNCH_BLOCKING set to 1)

Thank you! Until now I realized that the kernel is launched asynchronously by the driver…