I am new to CUDA and a freshman in here. Forgive me if this question has been asked many times. I need to do some operations in an array which has size more than 512 (e.g. 5120). My codes are something like:
Step 1: Copy array to the device
Step 2: Launch 5120 kernal threads to performance the operations.
Step 3: Copy the result array back to the host.
It is pretty generic . However my problem is that step 3 must be executed after every single thread has completed. How can I ensure that all thread completed? I know I may divide those threads into blocks (in this case 10 blocks) but can I synchronize the blocks?
Thanks in advance.