Streams and CPU


I’m trying to work on the CPU with the data generated by the kernel… that is, after copying data from the device to the host, and resend the kernel to generate more data…

This process is on a loop.

This is the algorithm but it doesn’t seem to work…

while (true)

  1. cudaMallocs() // asking for memory

  2. cudaDeviceSynchronize() // making sure everything is synchronized

  3. cudaEventRecord(start) // for taking time purposes

  4. call_async_Kernel(b, t, 0, 0)

  5. cudaEventRecord(stop)

  6. cudaMemcpyAsync(cudaMemcpyDeviceToHost)

  7. while (cudaEventQuery(stop) == cudaErrorNotReady)
    if (data_Available)

  8. cudaFree()

Maybe this is not the proper way to do it, any ideas?

Thanks in advance!

You may be interested in using cuda streams:

Create a stream, push all your work in an async way into this stream, and, right after the cudaMemcpyAsync that take data back to host, use a cudaStreamSynchronize:

/* Create stream : equivalent to a work queue */
cudaStream_t stram;

/*Perform memory allocation outside of the loop if it is possible because this operations is expensive in terms of time */

cudaMallocHost(); //Ask for pinned host memory for faster memcopies
cudaMallocDevice(); //device memory

while (true)
cudaMemcpyAsync(inputDevPtr , hostPtr , size, cudaMemcpyHostToDevice, stream);
MyKernel <<<64, 64, 0, stream>>>(b, t, 0, 0);
cudaMemcpyAsync(hostPtr , outputDevPtr , size, cudaMemcpyDeviceToHost, stream);



This way, you will be able to synchronize operation properly, even if you decide to do the same kind of loop in an other cpu thread that will use an other cuda stream.
But if you want to perform the (n+1)th cuda computation in parallel with the (n)th cpu computation, you will need to use more than 1 stream, and preferably more than 1 cpu thread.