Synchronizing Blocks

As far as I have understood, __syncthreads() synchronizes all the warps in a block ONLY.

My GPU has this configuration:
(2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024

So, if I launched more than 1024 threads in total, the blocks of my GPU will not be in sync, right?

And if the total number of threads I launch exceeds 4096, that is the physical number of threads my GPU has, how does the GPU take care of the matter?

Last and the most important query: How do I overcome this block synchronization issue so that I can launch, for example, 10,00,000 threads, and that they will be completely in sync?

It is not possible to keep threads running on different multiprocessors in sync at the instruction level.

Look into dynamic parallelism. It allows one kernel to launch child kernels of arbitrary size and when that child kernel is done, you know that all threads of this child kernel have just completed. This is a convenient way to achieve a barrier synchronization (the barrier is the end of the kernel and it affects the entire launch grid). CUDA 5.0 or later and Compute Capability 3.5 hardware or later is required.

The new CUDA 9 feature “cooperative thread groups” has a a variety of new synchronization features (“synchronize at any scale”) and even across multiple GPUs. I think this new feature requires Pascal (Compute 6.0 cards) or newer to operate. Also don’t expect to keep threads synchronized at the instruction level. You can only achieve synchronicity at specific points in your code (wherever you place those synchronization primitives)

you have already found that your GPU cannot run more than 4096 threads simultaneously. So, the way CUDA uses to execute 10M threads is the following - launch some 4K (or less) of them, wait until they are finished, launch new threads as resources are freed until all 10M threads will be executed. As you see, it’s hard to synchronize them all since new threads cannot be started at all until some previous are finished

Overall, do you read any CUDA book or plain CUDA manual? GPUs have some fundamental differences to CPUs, and without learning it hard way you will never got how GPU works

It’s easiest if you simply start with this basic principle of CUDA programming:

Each thread block executes independently of any other thread block.

Then design your code accordingly.