Asynchronous execution of kernels

I am trying to run start a kernel on the GPU and then still have the ability to issue cudaMemcpy commands while it is going. I have tried several methods, including creating new threads (which CUDA chokes on), setting up streams, and using the method in the AsyncAPI demo. In each case, memcpy and event calls can be asynchronous, but when calling a kernel execution of the main program stops until the kernel finishes.

Here is a basic example of code I’ve run (note that GPUFlag is a pointer to a GPU memory location):

cudaStream_t loop, setflag;

int flag = 1, counter = 0;

cudaEvent_t stop;

cudaStreamCreate(&loop);

cudaStreamCreate(&setflag);

CUDA_SAFE_CALL(cudaEventCreate(&stop));

CUDA_SAFE_CALL(cudaThreadSynchronize());

runLoopGPU<<<BLOCK_COUNT, THREAD_COUNT, 0, loop>>>();

cudaEventRecord(stop, loop);

CUDA_SAFE_CALL(cudaMemcpyAsync(GPUFlag, &flag, sizeof(int), cudaMemcpyHostToDevice, setflag));

printf("Waiting\n");

fflush(stdout);

while(cudaEventQuery(stop) == cudaErrorNotReady)

    counter++;

printf("Job's done with a count of %d\n", counter);

exit(0);

All runLoopGPU does is sit in a while loop checking GPUFlag for a value other than 0 (which, yes, it is initialized to before any of this is run). The software will sit in a loop until the watchdog timer stops the kernel execution. It will then print ‘Waiting’ to the screen and sit in the loop forever.

Is there some way to accomplish what I want to do?

Well, does it indeed or does it just once check the value of GPUFlag and then sit in an endless loop, e.g. because you forgot to use “volatile” or a compiler bug? Hard to know without seeing the code for it and the generated GPU asm.

There is no question that you are misusing the CUDA API though, so even if you get it to work you will have to expect a lot of problems.