Concurrent CPU and GPU execution on TESLA

Good day,

Noob here and I have a simple question, can we call a device kernel and carry on with CPU computation without waiting for the GPU to return? Something like the right part of the attached image.

Is this possible on TESLA? FERMI?

Any simple examples?

Thank you,

Fadhel

This is actually the default. Kernel calls are asynchronous, i.e. they return before their work is done. You can simply conduct normal CPU computations afterwards and they will be overlapped with the GPU computations. Synchronization is only done either explicitly with functions like cudaDeviceSynchronize() or implicitly before blocking functions like cudaMemcpy() from/to host. (Chapter 3.2.5 of the CUDA C Programming guide has further details and also discusses things like concurrent kernels and asynchronous memcpys.)

Thank you!