CUDA debugging issues

I am perplexed about a couple of issues with CUDA: I wanted to know if this is just my experience or others also have faced similiar issues and if there are any workarounds to this. I have noticed that if there is a error with running the CUDA function, the control quickly returns to the host machine in 10s of microseconds or so. Is there any way to capture this error or to find out more about what happened ? for instance I could be copying elements to an array which I have allocated using cudaMalloc and the number of elements are more than the memory allocated or there might be issues related to register utilization for instance, trying to use more registers than is physically present? Has anyone else faced these issues?

The second issue I have is that sometimes the global memory of the GPUs seems like it retains the data from the previous computation: so if in the previous case, the first time I ran the code, it executed allright, and then if I run the program again this time by trying to use more elements than is allocated, then it passes the control quickly to the CPU but the results of the computation still seem to be correct from last time. I am wondering if this is again something that I have only faced for instance. thanks for the replies.

Use CUT_CHECK_ERROR from cutil or do:

kernel<<<grid,threads>>>();

cudaThreadSynchronize(); // neede because kernel calls are asynchronous

cudaError_t err = cudaGetLastError();

if (err != cudaSuccess)

    // handle error (can use cudaGetErrorString to decode error numbers to strings, though the method might be named differently, look it up)

For perforamance reasons in release builds, you probably only want to check errors in debug mode. CUT_CHECK_ERROR already does this.

I am not sure this always works; for example the same example I gave earlier, I allocated some memory and then in my kernel I access far more elements than I should have, I still get CUDAsuccess for some reason. The most reliable strategy I have worked on so far is that while copying my results array to the GPU memory, I initialize a host array of the same size to 0, and copy the 0s to the GPU memory. Now incase the computation goes wrong, this array then copied to the host memory after completion of the kernel should have meaningful/right values.

If you write past the end of an array in memory, it is not always caught as an error right away. Usually, a CUT_CHECK_ERROR will return an error during another kernel call long after the one that wrote past the memory.

Errors such as requesting too many registers, too much shared memory, using an unbound texture and many other errors which cause the kernel to return right away will be caught by CUT_CHECK_ERROR immediately.