While the GPU is executing call_Kernel, I want the CPU to execute do_something. Is it possible to do this in CUDA? Is there something special I need to make sure the CPU starts the next statement in the program (after the kernel call) without waiting for the results from the kernel call.
Once a kernel is called, the CPU continues executing asynchronous of the kernel execution and is not guaranteed to have exited until a global synchronize is done or a synchronize on the stream associated with a kernel execution. This is covered in the CUDA 2.0 Programming Guide.
I have a couple of other questions. When I run deviceQuery on my card it says…
" Concurrent copy and execution: No"
What does this mean? Does it mean prefetching might not give me a better performance? Will the CPU be able to execute asynchronously with GPU in my card? (Please find output of entire deviceQuery below)
Also, I have 2 cards installed in another machine. How can I chose which card to run my program on?
Thanks a lot!
Device 0: “Tesla C870”
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 1610350592 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.35 GHz
Concurrent copy and execution: No
Concurrent copy and execution refers to “GPU from/to System RAM copy AND GPU kernel execution”
Some GPUs support “cudaMemcpy” concurrently with “GPU execution”. The “Streams” concept takes advantage of this. So, when GPU_KERNEL execution is happening for kernelA, GPU_MEMCPY could be happening for another kernel… and so on…