I am trying to run start a kernel on the GPU and then still have the ability to issue cudaMemcpy commands while it is going. I have tried several methods, including creating new threads (which CUDA chokes on), setting up streams, and using the method in the AsyncAPI demo. In each case, memcpy and event calls can be asynchronous, but when calling a kernel execution of the main program stops until the kernel finishes.
Here is a basic example of code I’ve run (note that GPUFlag is a pointer to a GPU memory location):
cudaStream_t loop, setflag;
int flag = 1, counter = 0;
cudaEvent_t stop;
cudaStreamCreate(&loop);
cudaStreamCreate(&setflag);
CUDA_SAFE_CALL(cudaEventCreate(&stop));
CUDA_SAFE_CALL(cudaThreadSynchronize());
runLoopGPU<<<BLOCK_COUNT, THREAD_COUNT, 0, loop>>>();
cudaEventRecord(stop, loop);
CUDA_SAFE_CALL(cudaMemcpyAsync(GPUFlag, &flag, sizeof(int), cudaMemcpyHostToDevice, setflag));
printf("Waiting\n");
fflush(stdout);
while(cudaEventQuery(stop) == cudaErrorNotReady)
counter++;
printf("Job's done with a count of %d\n", counter);
exit(0);
All runLoopGPU does is sit in a while loop checking GPUFlag for a value other than 0 (which, yes, it is initialized to before any of this is run). The software will sit in a loop until the watchdog timer stops the kernel execution. It will then print ‘Waiting’ to the screen and sit in the loop forever.
Is there some way to accomplish what I want to do?