I’m trying to run my CPU and GPU in parallel. I have a compute-intensive GPU kernel and I want the CPU to do other computations while waiting for the GPU results. My code has several CPU threads, one is calling the GPU kernel, and the rest have their own compute-intentive tasks that run on the CPU itself. I use CUDA call cudaSetDeviceFlags() and expected it to yield the CPU thread that invokes the CUDA kernel while the kernel is running, so that the CPU-core will be available for other threads. In practice, no matter what parameter I give cudaSetDeviceFlags(), the thread does not yield. Any ideas?
I’m attaching part of my code and a screen shot from Nsight that shows the CPU thread running and not yield.
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync); // I tried cudaDeviceScheduleSpin, cudaDeviceScheduleYield, cudaDeviceScheduleBlockingSync, cudaDeviceScheduleAuto. All give the same results. cudaStatus = cudaDeviceSynchronize(); bp_2xrts_kernel<<<blocks, threads>>>(outBuffer, inBuffer); cudaStatus = cudaDeviceSynchronize();