Heterogenour programming

bugBot · November 20, 2008, 1:00am

Is it possible to do heterogeneous programming with CUDA and the NVIDIA GPU’s?

That is, I want to make a call to a kernel function and while I am waiting for the results, I want the CPU to do something useful. For eg,

call_Kernel(a,b ); // GPU computes this.
do_something(); // CPU computes this.

While the GPU is executing call_Kernel, I want the CPU to execute do_something. Is it possible to do this in CUDA? Is there something special I need to make sure the CPU starts the next statement in the program (after the kernel call) without waiting for the results from the kernel call.

Thanks in advance.

pstach · November 20, 2008, 2:16am

You’ve basically answered the question yourself

call_Kernel(a,b );  // GPU computes this.

do_something();  // CPU computes this.

cudaThreadSynchronize();

Once a kernel is called, the CPU continues executing asynchronous of the kernel execution and is not guaranteed to have exited until a global synchronize is done or a synchronize on the stream associated with a kernel execution. This is covered in the CUDA 2.0 Programming Guide.

-Patrick

bugBot · November 24, 2008, 3:20am

Thanks a lot!

I have a couple of other questions. When I run deviceQuery on my card it says…

" Concurrent copy and execution: No"

What does this mean? Does it mean prefetching might not give me a better performance? Will the CPU be able to execute asynchronously with GPU in my card? (Please find output of entire deviceQuery below)

Also, I have 2 cards installed in another machine. How can I chose which card to run my program on?

Thanks a lot!
Sundaresan

output:

Device 0: “Tesla C870”
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 1610350592 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.35 GHz
Concurrent copy and execution: No

Test PASSED

Sarnath · November 24, 2008, 6:36am

Concurrent copy and execution refers to “GPU from/to System RAM copy AND GPU kernel execution”

Some GPUs support “cudaMemcpy” concurrently with “GPU execution”. The “Streams” concept takes advantage of this. So, when GPU_KERNEL execution is happening for kernelA, GPU_MEMCPY could be happening for another kernel… and so on…

MisterAnderson42 · November 24, 2008, 1:51pm

cudaSetDevice(). See the programming guide and reference manual for more information.

Topic		Replies	Views
Concurrent CPU and GPU execution on TESLA CUDA Programming and Performance	2	11240	August 8, 2011
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23705	October 7, 2010
Kernel Synchronization in CUDA not fully explained in programming guild CUDA Programming and Performance	1	10689	February 25, 2010
A question about kernel execution CUDA Programming and Performance	1	2687	August 24, 2009
STATUS OF CALL Status of kernel Execution CUDA Programming and Performance	2	6986	December 17, 2007
Overlapping GPU and CPU computation? CUDA Programming and Performance	9	1362	November 19, 2010
Asynchronous HtoD memtransfer need to have it asynchronous for cpu, but synchronous for the GPU CUDA Programming and Performance	6	1106	September 9, 2010
CPU & GPU working at the same time... CUDA Programming and Performance	2	1440	October 7, 2008
global functions asynchronous? CUDA Programming and Performance	2	3104	August 4, 2007
Why would multiple GPUs not run in parallel? CUDA Programming and Performance	1	572	August 8, 2018

Heterogenour programming

Related topics