I am new to CUDA and a freshman in here. Forgive me if this question has been asked many times. I need to do some operations in an array which has size more than 512 (e.g. 5120). My codes are something like:
Step 1: Copy array[5120] to the device
Step 2: Launch 5120 kernal threads to performance the operations.
Step 3: Copy the result array[5120] back to the host.
It is pretty generic . However my problem is that step 3 must be executed after every single thread has completed. How can I ensure that all thread completed? I know I may divide those threads into blocks (in this case 10 blocks) but can I synchronize the blocks?
Synchronizing the blocks while the kernel is executing is not possible, though some people tried to do this. But you just want to copy the result of your kernel back, and this is very simple: a call to cudaThreadSynchronize() on the host side will block until all thread blocks (the whole kernel) have finished. Then you can copy back your result…
Note: more than 10 blocks could increase your performance. You should have at least 2 blocks per multiprocessor.
This is what will happen if you launch the kernel and then immediately [queue up a] cudaMemcpy() back to the host. The copy will not execute before the kernel is done.