Cuda KERNEL_LAUNCH_FAILED when I call the same kernel immediately after the previous call took place

I have the following while loop setup with with kernel k,

results1_h - Used to hold results from the first kernel call 
results2_h - Used to hold results from the second kernel call.

while (1) {
    1. Launch k.  Also send a device pointer results_d to the
       kernel to store the results.

    2. Do a MemcpyFromDtoH(results1_d, results1_h)

    3. Parse results2_h.  <- The first time we eneter the 
                             loop there are no results in 
                             results2_h and we should sail
                             through to the next step.

    4. Call cuCtxSynchronize().

    5. Launch kernel k again. Send the same device pointer 
       results_d to the kernel which we sent in (1)

    6. Do a MemcpyFromDtoH(results1_d, results2_h) <- This 
       is where it returns a LAUNCH_FAILED error.

    7. Parse results2_h.

    8. Call cuCtxSynchronize().
}

At step (6) I get a LAUNCH_FAILED. I’m unsure as to why I get this error. I have called cuCtxSynchronize() at (4), so the gpu is free and so I should be able to call the kernel again like before, shouldn’t I?

Also note that I have created the cuda context with CU_CTX_SCHED_BLOCKING_SYNC. So the call to cuCtxSynchronize would block and so the 2 kernel calls and the corresponding memcpyDtoH from each kernel call happen sequentially.

What am I doing wrong that I’m getting a launch failure there?

I am running a gtx 680 (CC 3.0) with cuda toolkit 4.2

CUDA launches are asynchronous. The MemcpyFromDtoH at line 6 is actually waiting for the outstanding work to complete before copying the results back. As a result, errors in the kernel launched at step 5 are being reported at step 6.

Have you tried running your application with cuda-memcheck or cuda-gdb to identify the source of the error in your kernel ?

No, I haven’t tried cuda-gdb or cuda-memcheck. I am using the driver version of the api btw. Will try them next. But notice anything mildly wrong with the above code?

I understand that the memcpy reports the error from the previous launch since it is async, but I’m not sure why it errors out in the first place.

For example if I remove the overlapping of the next kernel call with the parsing of previous call results, I don’t error out ever. This below code works perfectly fine.

while (1) {
    1. Launch k. Also send a device pointer results_d to the
        kernel to store the results.

    2. Do a MemcpyFromDtoH(results1_d, results1_h)

    4. Call cuCtxSynchronize().

    7. Parse results1_h.
}

No error here. The moment I try to call the kernel again before parsing results1_h I get the error for the second kernel

Never mind folks. Figure out the answer. Was dereferencing some junk in the kernel in the subsequent calls. Thanks guys.

cuda-gdb helped as well.