in my code, I have a similar sequence:
// do some stuff, launch kernels, etc res = cudaDeviceSynchronize(); // check res res = cudaGetLastError(); // check res
All calculations are done on the default stream and one thread.
The cudaDeviceSynchronize returns cudaSuccess, but the cudaGetLastError call returns an invalid device function error.
Should this be possible according the CUDA API specification?
I mean the sync call should wait until the device is finished, so no errors should be emitted between those two code lines (once again, assuming a single threaded app).
How can this happen?