question about __syncthreads();



synchronizes treads inside a block.

How can I synchronize treads from all blocks ???


CUDA does not support block level synchronization. What problem are you trying to solve that requires block synchronization?

ouch, that’s a problem.

I have a loop for a array that is huge and I am transferring to cuda. It requires many treats and many blocks to complete.

the classic

array[i] = …;

I get an error if I use a huge amount of treats and only one block. So I have to divide up and use several blocks.

I need to get the result of this computation and synchronize it so that all i for array can be read and the execute and other

array[k] = …;

using the previous results.

I wanted to try to avoid calling the device kernel twice in two different functions because it takes time and the proporse of this school project is performance.

The rationale against block synchronization is this: You can (and are strongly encouraged to) request more blocks for your kernel than there are multiprocessors on your card. In the CUDA model, multiple blocks can run on one multiprocessor if registers and shared memory usage permits. However, a block can never be “swapped out” of a multiprocessor to allow another block to run. Once a block starts, it runs to completion. As blocks finish, the blocks that haven’t started yet are loaded onto the multiprocessor to run.

A block-level synchronization construct could cause deadlocks for any kernel where the number of blocks is greater than the number of multiprocessors, which describes most kernels people write. All the running blocks would be stuck waiting for blocks that hadn’t even started yet, and can’t be loaded since all the multiprocessor resources are occupied. (Additionally, the hardware is simpler if you don’t allow multiprocessors to be able to synchronize with each other.)

I’d benchmark the multiple kernel call approach and see how fast it really is. You might be surprised. Keep in mind that there is some driver overhead the first time you call a particular kernel in your program, so when doing timing studies, you should “warm up” your kernels by calling them once before your timer starts.

Oh I understand now.

Excellent explanation!

Thanks a lot !

Interesting point abput warm up kernel.
Could you give more details on that? I suppose each call to kernel function is independant, no matter the same kernel or different kernels, how ‘warm up’ affect performance in reality?

Hi Seibert,

THats an excellent explanation! It never occured to me all these days. THank you.

The first call you make needs to initialize the driver, allocate some memory for CUDA on the GPU, initialize the context, copy your kernel code over to the GPU, and probably a few other things I’m missing. As such, benchmarks should always be performed like this:

kernel<<<grid, threads >>>();


start = timer

for (int i = 0; i < 1000; i++)

    kernel<<<grid, threads>>>();


end = timer

print (end - start)/1000

… If your kernel takes a long time to execute, you can adjust the 1000 repeats down.

This is only good for timing purposes I hope? because you don’t want to run you kernel 1000 times or even 10 times… I just want to run it 1 time and be fast…

Well, in my app I run kernels 10’s of millions of times so a few thousand is a very short benchmark :) The point is to get a decent average of your performance by running many tests, otherwise you can’t believe your results.

If you really only want one kernel run per application run, it will never be fast because of the driver initialization overhead, not to mention the time spent loading the application off the disk. I’ve never timed it myself, but I recall others on the forums mentioning initialization times being ~1/2 a second. That is an eternity compared to a kernel that executes in less than 1ms.