ULF, how to start debugging ULF on device to host data transfer

Apologize for the generality of the question, I’m new to GPU/CUDA programming.

I have an application with two kernels (call them k1 and k2).
I transfer data to the device, run K1 N times and transfer data out of the device.
all is well.

If I transfer data in, run K1 N times, then run K2 once (and it hardly does anything)
then when i try to transfer data out i get ULF on the memcpy device to host.

My question is: How do i start debugging this?

What conditions cause a data transfer to ULF?


First is to see if the kernel is fine or not. Use -deviceemu for that and debug it normally using your fav IDE(breackpoints, watches, locals, blah blah ). Once you are 100% sure your kernel is fine, remove the -deviceemu to get full HW-acceleration.

For the memcpys you cannot debug them ( are inside the cudart.dll/cuda.dll + driver and aren’t public ). Just use a cudaErr_t err = cudaThreadSyncronize() after calling them and see the return error code ( you can use this after your <<< >>> kernel too ). It can give you an idea about what’s happening if you have luck and doesn’t emit a generic error. See the CUDA_SAFE_CALL macro for more info about how to use it.

I have a suggestion about the cudaMalloc/cudaMemcpy/kernel debug… could be possible to add a debug layer like Direct3D debug mode/glExpert using a control panel or something so more “descriptive” errors would be displayed in the IDE debug window via OutputDebugString() pls?

thanks for the reply.

My code runs fine with -deviceemu
that is, without the ULF.

I suppose this still leaves the possibility K2
or something is not compiled properly for the device