Synchronizing Blocks

mashnoon.islam · January 9, 2018, 2:58pm

As far as I have understood, __syncthreads() synchronizes all the warps in a block ONLY.

My GPU has this configuration:
(2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024

So, if I launched more than 1024 threads in total, the blocks of my GPU will not be in sync, right?

And if the total number of threads I launch exceeds 4096, that is the physical number of threads my GPU has, how does the GPU take care of the matter?

Last and the most important query: How do I overcome this block synchronization issue so that I can launch, for example, 10,00,000 threads, and that they will be completely in sync?

cbuchner1 · January 9, 2018, 3:40pm

It is not possible to keep threads running on different multiprocessors in sync at the instruction level.

Look into dynamic parallelism. It allows one kernel to launch child kernels of arbitrary size and when that child kernel is done, you know that all threads of this child kernel have just completed. This is a convenient way to achieve a barrier synchronization (the barrier is the end of the kernel and it affects the entire launch grid). CUDA 5.0 or later and Compute Capability 3.5 hardware or later is required.

The new CUDA 9 feature “cooperative thread groups” has a a variety of new synchronization features (“synchronize at any scale”) and even across multiple GPUs. I think this new feature requires Pascal (Compute 6.0 cards) or newer to operate. Also don’t expect to keep threads synchronized at the instruction level. You can only achieve synchronicity at specific points in your code (wherever you place those synchronization primitives)

BulatZiganshin · January 10, 2018, 12:59am

you have already found that your GPU cannot run more than 4096 threads simultaneously. So, the way CUDA uses to execute 10M threads is the following - launch some 4K (or less) of them, wait until they are finished, launch new threads as resources are freed until all 10M threads will be executed. As you see, it’s hard to synchronize them all since new threads cannot be started at all until some previous are finished

Overall, do you read any CUDA book or plain CUDA manual? GPUs have some fundamental differences to CPUs, and without learning it hard way you will never got how GPU works

njuffa · January 10, 2018, 3:45am

It’s easiest if you simply start with this basic principle of CUDA programming:

Each thread block executes independently of any other thread block.

Then design your code accordingly.

Topic		Replies	Views
Mapping between CUDA cores and threads CUDA Programming and Performance	7	15354	December 2, 2011
finding the best number of threads per block CUDA Programming and Performance	3	7842	January 29, 2010
Threads vs Blocks How does one achieve maximum parallelism? CUDA Programming and Performance	1	1020	April 2, 2010
a simple question about the resident blocks per multiprocessor CUDA Programming and Performance	6	3807	August 23, 2017
Using <<<...>>> CUDA Programming and Performance	6	2476	June 19, 2011
Organization of threads CUDA Programming and Performance	1	643	December 21, 2011
I want to Implement 10.000 Cores in GPU, each making an arithmetic equation, is possible to do: I wi CUDA Programming and Performance	4	1619	February 4, 2016
Architecture Questions CUDA Programming and Performance	6	8166	February 12, 2008
cuda block synchronization CUDA Programming and Performance	1	975	June 19, 2011
Distribution of Threads to Multiprocessors CUDA Programming and Performance	8	13608	June 8, 2011

Synchronizing Blocks

Related topics