question about __syncthreads();

mc_line · March 16, 2008, 7:03pm

Hi

__syncthreads();

synchronizes treads inside a block.

How can I synchronize treads from all blocks ???

Thanks

seibert · March 16, 2008, 7:10pm

CUDA does not support block level synchronization. What problem are you trying to solve that requires block synchronization?

mc_line · March 16, 2008, 7:32pm

ouch, that’s a problem.

I have a loop for a array that is huge and I am transferring to cuda. It requires many treats and many blocks to complete.

the classic

for(…)
array[i] = …;

I get an error if I use a huge amount of treats and only one block. So I have to divide up and use several blocks.

I need to get the result of this computation and synchronize it so that all i for array can be read and the execute and other

for(…)
array[k] = …;

using the previous results.

I wanted to try to avoid calling the device kernel twice in two different functions because it takes time and the proporse of this school project is performance.

seibert · March 16, 2008, 9:10pm

The rationale against block synchronization is this: You can (and are strongly encouraged to) request more blocks for your kernel than there are multiprocessors on your card. In the CUDA model, multiple blocks can run on one multiprocessor if registers and shared memory usage permits. However, a block can never be “swapped out” of a multiprocessor to allow another block to run. Once a block starts, it runs to completion. As blocks finish, the blocks that haven’t started yet are loaded onto the multiprocessor to run.

A block-level synchronization construct could cause deadlocks for any kernel where the number of blocks is greater than the number of multiprocessors, which describes most kernels people write. All the running blocks would be stuck waiting for blocks that hadn’t even started yet, and can’t be loaded since all the multiprocessor resources are occupied. (Additionally, the hardware is simpler if you don’t allow multiprocessors to be able to synchronize with each other.)

I’d benchmark the multiple kernel call approach and see how fast it really is. You might be surprised. Keep in mind that there is some driver overhead the first time you call a particular kernel in your program, so when doing timing studies, you should “warm up” your kernels by calling them once before your timer starts.

mc_line · March 17, 2008, 12:06am

Oh I understand now.

Excellent explanation!

Thanks a lot !

Hella_Yu · March 17, 2008, 3:16am

Interesting point abput warm up kernel.
Could you give more details on that? I suppose each call to kernel function is independant, no matter the same kernel or different kernels, how ‘warm up’ affect performance in reality?

Sarnath · March 17, 2008, 5:03am

Hi Seibert,

THats an excellent explanation! It never occured to me all these days. THank you.

MisterAnderson42 · March 17, 2008, 12:37pm

The first call you make needs to initialize the driver, allocate some memory for CUDA on the GPU, initialize the context, copy your kernel code over to the GPU, and probably a few other things I’m missing. As such, benchmarks should always be performed like this:

kernel<<<grid, threads >>>();

cudaThreadSynchronize();

start = timer

for (int i = 0; i < 1000; i++)

    kernel<<<grid, threads>>>();

cudaThreadSynchronize();

end = timer

print (end - start)/1000

… If your kernel takes a long time to execute, you can adjust the 1000 repeats down.

jordyvaneijk · March 17, 2008, 1:04pm

The first call you make needs to initialize the driver, allocate some memory for CUDA on the GPU, initialize the context, copy your kernel code over to the GPU, and probably a few other things I’m missing. As such, benchmarks should always be performed like this:
kernel<<<grid, threads >>>();

cudaThreadSynchronize();

start = timer

for (int i = 0; i < 1000; i++)

    kernel<<<grid, threads>>>();

cudaThreadSynchronize();

end = timer

print (end - start)/1000
… If your kernel takes a long time to execute, you can adjust the 1000 repeats down.

[snapback]344139[/snapback]

This is only good for timing purposes I hope? because you don’t want to run you kernel 1000 times or even 10 times… I just want to run it 1 time and be fast…

MisterAnderson42 · March 17, 2008, 3:06pm

Well, in my app I run kernels 10’s of millions of times so a few thousand is a very short benchmark :) The point is to get a decent average of your performance by running many tests, otherwise you can’t believe your results.

If you really only want one kernel run per application run, it will never be fast because of the driver initialization overhead, not to mention the time spent loading the application off the disk. I’ve never timed it myself, but I recall others on the forums mentioning initialization times being ~1/2 a second. That is an eternity compared to a kernel that executes in less than 1ms.

Topic		Replies	Views
cuda block synchronization CUDA Programming and Performance	4	8369	June 20, 2011
cuda block synchronization CUDA Programming and Performance	1	973	June 19, 2011
synchronisation between blocks CUDA Programming and Performance	2	1472	June 11, 2009
Need synchronization between blocks? CUDA Programming and Performance	3	3086	September 16, 2009
sync over blocks age old question CUDA Programming and Performance	2	2877	September 9, 2008
Synchronize all blocks in CUDA CUDA Programming and Performance	12	44103	October 25, 2013
Synchronizing Blocks CUDA Programming and Performance	3	2345	January 10, 2018
synchronization between blocks CUDA Programming and Performance	2	747	December 5, 2014
Thread synchronization across blocks CUDA Programming and Performance	5	2839	July 2, 2008
Synchronization across all threads CUDA Programming and Performance	9	6592	August 22, 2008

question about __syncthreads();

Related topics