Is it possible to run a cuda kernel on several cpu threads? and How it works?

Hi all.
I’m processing the several requests in parallel in my PC.
The procedure is implemented as a cuda kernel.
Is it possible to run a cuda kernel on several cpu threads? and How it works? and what about performance?
Thanks in advance.

Yes, it’s possible. The CUDA OpenMP sample code gives one example:

http://docs.nvidia.com/cuda/cuda-samples/index.html#cudaopenmp

This happens to be launching kernels on separate GPUs, but it’s not difficult to modify the code to run on a single GPU. (You may also want to look at the cuda concurrent kernels sample code, which is not multi-threaded, but demonstrates running multiple concurrent kernels on the same GPU.)

The simple Multi-GPU sample may also be of interest:

http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-multi-gpu

Thanks, txbob.
I want to know how performance would be got in two cases either.
That is - duration of n times execution of a kernel vs duration of execution a kernel on n cpu threads.
I will be appreciated to be answered from you about this, thanks
PS: duration of execution of a kernel by n cuda streams, either.