Heterogenour programming

Is it possible to do heterogeneous programming with CUDA and the NVIDIA GPU’s?

That is, I want to make a call to a kernel function and while I am waiting for the results, I want the CPU to do something useful. For eg,

call_Kernel(a,b ); // GPU computes this.
do_something(); // CPU computes this.

While the GPU is executing call_Kernel, I want the CPU to execute do_something. Is it possible to do this in CUDA? Is there something special I need to make sure the CPU starts the next statement in the program (after the kernel call) without waiting for the results from the kernel call.

Thanks in advance.

You’ve basically answered the question yourself

call_Kernel(a,b );  // GPU computes this.

do_something();  // CPU computes this.


Once a kernel is called, the CPU continues executing asynchronous of the kernel execution and is not guaranteed to have exited until a global synchronize is done or a synchronize on the stream associated with a kernel execution. This is covered in the CUDA 2.0 Programming Guide.


Thanks a lot!

I have a couple of other questions. When I run deviceQuery on my card it says…

" Concurrent copy and execution: No"

What does this mean? Does it mean prefetching might not give me a better performance? Will the CPU be able to execute asynchronously with GPU in my card? (Please find output of entire deviceQuery below)

Also, I have 2 cards installed in another machine. How can I chose which card to run my program on?

Thanks a lot!


Device 0: “Tesla C870”
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 1610350592 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.35 GHz
Concurrent copy and execution: No


Concurrent copy and execution refers to “GPU from/to System RAM copy AND GPU kernel execution”

Some GPUs support “cudaMemcpy” concurrently with “GPU execution”. The “Streams” concept takes advantage of this. So, when GPU_KERNEL execution is happening for kernelA, GPU_MEMCPY could be happening for another kernel… and so on…

cudaSetDevice(). See the programming guide and reference manual for more information.