How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads())

I am running a kernel that would hypothetically look like this:

__global__ void complexKernel(unsigned int arraySize, float* c, float* c_1, float* c_2) {

    int i = threadIdx.x;
    for (int sample = 0; sample < 1024; sample++) {

        //push back
        c_2[i] = c_1[i];
        c_1[i] = c[i];
        
        __syncthreads();

        //increment
        c[i] = c[i] + 1;

        __syncthreads();

        //smooth
        if (i > 0 && i < arraySize - 1) {
            c[i] = (c[i] + c[i] + c[i + 1] + c[i - 1]) * 0.25;
        }
        __syncthreads();
    }
}

Because I need synchronization at multiple steps of each loop (this is just an example of the principle), I understand I must keep my kernel to 32 threads and on one block.

Is it sufficient to run this just as:

kernel<<<1, 32, 0, stream>>>(parameters...)

And this will guarantee my __syncthreads(); are all successful and I will only be in one block where the perfect synchronization is possible?

Or is there anything else I must do? Thanks.

__syncthreads() will synchronize all threads within a thread block, regardless of the number of threads. It does not need to be 32 threads.

kernel<<<1, 32, 0, stream>>> will use only 1 thread block of size 32, yes.

1 Like

There is no absolute guarantee that the threads wonā€™t diverge at any point in time, but in practice, the control flow stays together, as long as you do not practically diverge it with control structures (if, for, ā€¦).

Assume for correctness that there are divergences between threads of a warp at any time. Assume for performance that divergences only happen at logical points (e.g. if), and perhaps even that the assembler itself reconverges the threads.

One simple experiment I just did with a kernel<<<1, lessThan32, 0, stream>>> command was as follows:

__global__ void complexKernel(unsigned int arraySize, unsigned int numThreads, float* c, float* c_1, float* c_2) {
    int i = threadIdx.x;
    for (int sample = 0; sample < 1024; sample++) {

        //push back
        c_2[i] = c_1[i];
        c_1[i] = c[i];
        
        //__syncthreads();

        if (i == 2) {
            for (int j = 0; j < 100; j++) {
                float wasteTime = pow(52524, 5224);
            }
        }

        //increment (only if synchronized)
        if (i == 0) {
            c[i] = c[i] + 1;
        }
        else if (i > 0 && c_1[i] == c_1[i - 1]) {
            c[i] = c[i] + 1;
        }
        //__syncthreads();

        //smooth
        if (i > 0 && i < arraySize - 1) {
            c[i] = (c[i] + c[i] + c[i + 1] + c[i - 1]) * 0.25;
        }
        //__syncthreads();
    }
}

With numThreads 5 and arraySize 5, this is just taking three arrays, pushing back the samples, incrementing per loop (so ideally all in C should be 1024 by the end), then averaging with neighbors.

If they desynchronize, you would be expecting to get not 1024 on each i by the output. (I believe). I am both smoothing and only incrementing if the sync remains at all times. I think both should be evident if a desync occurs.

While the wasteTime function does take a lot of time and slows the overall process down, they still donā€™t desynchronize in this case. The result is the same whether there are __syncthreads() or not - I get 1024 no matter what on all outputs from c[] by the end.

What is the reason for this? Why is no __syncthreads() needed? Is it predictable that these should not desynchronize? I would have thought I would have needed __syncthreads in this case and yet do not.

Even if your program is not correct (not enough __syncthreads), you get no guarantee for a wrong output. It is your responsibility to create a correct program. Use the available tools, e.g. cuda-memcheck.

__syncthreads() makes sure that memory is visible for other threads. It also corrects divergence, but for that function doing both warp-wide with __syncwarp() would be enough.

For memory to be visible there is much more than everything running in the correct order. If it is between blocks, you have caches to be updated. And within a block, the compiler has to know those points (by you manually specifiying them). Otherwise it may cache old memory values in registers instead of writing into or rereading from memory.

If it is about correctness, it is not enough to test and see, if the result is correct, you have to follow the rules to be sure.

BTW For wasteTime to really waste time, you should neither give a value that can be calculated at compile-time, nor a vlaue, which is the same for each loop iteration, nor a value, which is never used later on. All three are reasons that the calculations are removed.

A fundamental principle in CUDA is that the CUDA programming model provides no guarantees as to order of thread execution. CUDA savvy people commonly recognize that certain specific implementation of the CUDA programming model may have ā€œlockstepā€ execution characteristics, which are not inherent or required by the CUDA programming model.

An additional fundamental principle in CUDA is that if you write code that depends on thread ordering for correctness, and you do not explicitly provide for such ordering in your code, your code is by definition broken, regardless of the results it produces.

Judging code correctness solely from results correctness is hazardous.

Correct code must produce correct results.
Correct results do not necessarily imply correct code.

CUDA implementations (i.e. GPUs) have a concept of a warp. The launch you have shown will often result in lockstep execution, even in the presence of ā€œminorā€ conditional disruption of ā€œwarp convergenceā€.

Yes, for purists, the warp is now exposed in the programming model. That does not abrogate any statements I have made. Warp-synchronous expectation is expressly discouraged.

Oops, yes you are correct. I am new. I wanted to use __syncwarp() here not __syncthreads(). I need synchrony for my designs to work, so I am planning to just stick to <32 threads and one block each time and use __syncwarp() copiously throughout. Sounds like that should be reasonably adequate for my needs. Thanks.

I mentioned elsewhere that that is a bad idea for performance.

I am curious what you think would be the solution to my problem then.

I am attempting to render real time audio (audio synthesis) in terms of finite difference modeling (eg. wave equation) where the solution to each time step is highly parallelized.

However, there are multiple sequential steps that must take place at each audio sample, and each step must be done before the next.

At 512-1024 samples per audio buffer, and a sampling rate of 44100 Hz, we have 23 ms per buffer to process (or really even just a tiny fraction of that since this is buffer for ALL audio processes to finish).

I tried enqueuing loads of sequential kernels but the latency is brutal just even with empty kernels. If you run a loop of 1024 (1024 samples, one buffer) with four kernel steps to enqueue per sample, this costs at least 10-15 ms even with empty kernels. Both on CUDA or OpenCL.

So if I want synchrony at several steps along each sample processing, and also need synchrony from one sample to the next, and I am trying to process 512-1024 samples in one kernel, what other option do I have?

As I see it the only reasonable option is to limit it to 32 threads on one block and use __syncwarp().

According to the comment there:

You can (and should) launch larger thread blocks, maybe 128 or 256 threads. You can still use __syncwarp() to synchronize groups of 32 threads (thread id 0ā€¦31 is one warp, 32ā€¦63 is the second warp, etc). Block sizes smaller than 64 limit the occupancy since there is a limit on the number of blocks per SM.

However, this is approach useless for me as I must attain synchrony across all threads or the simulation will fail. Having desynchronized parts of a wave equation for example means you donā€™t have a functioning wave equation.

At least use several warps with one block, if useful. Synchronization should be fast enough.

There is also grid-wide synchronization (= between blocks) available from within kernels.

Alternatively the border regions between blocks in each kernel new, so you do not need synchronization.

Use ring buffers instead of moving elements in each iteration.

Use shared memory.

Perhaps your task is not well suited for parallel processing or parallel processing on GPU.

1 Like

Are you suggesting I would use __syncthreads() then as once I extend beyond one warp this is my only option to keep sync through each step of the solution, right?

The programming for the kernel is not much different either way. I just must add a small for loop which may or may not be used, as if I am limited to 32 threads and need to calculate 256 nodes of data for example, I would just iterate through a few nodes per thread. If I use 256 nodes and 256 threads, then obviously that is not necessary, but that is the only change.

Once I get the kernel running I will test it both ways then and see.

ā€œPerhaps your task is not well suited for parallel processing or parallel processing on GPU.ā€

šŸ˜‚ Perhaps, but it is even less well suited for a CPU.

On a modern Linux system, launching an empty kernel and waiting for its completion should take a few micro seconds, not milliseconds.

I was not referring to one, but thousands.

I tried enqueuing a loop of 1024 with 4 kernels per loop (ie. 4096 serial Kernels). Takes ~10-12 ms on OpenCL and ~12-15 ms on CUDA from my tests.

Was not practical. Only choice I see is to limit to 32 threads and run one kernel to handle the whole job with sync operations (__syncwarp in CUDA or barrier(CLK_LOCAL_MEM_FENCE); in openCL) as needed inside it. I will have to test and see how that performs.

It seems like you have a process that

(1) offers limited parallelism
(2) is latency sensitive

which frankly does not strike me as a great match for GPUs, which desire massive parallelism and are architecturally optimized for high throughput. What up-front analysis suggested that GPUs would be a good target platform?

The overhead could perhaps be slightly accelerated with Cuda launch graphs.

How much can you do in parallel?

Perhaps block-wide and grid-wide sync could be used.

Otherwise doing overlapping blocks with calculating the overlapping regions separately on each side could make sense.

E.g. if you have 1024 iterations, you could let one kernel launch do 8 iterations. That lowers the kernel launches to 128.

Then instead of distributing the work onto blocks as

0..31 ---- 32..63 ---- 64..95 ---- 96..127

you do

0..38 ---- 25..70 ---- 57..102 ---- 89..127

(numbers are just examples for the principle)

Each block can now work independently for the small cost of some doubled calculations.

Instead of launching a new kernel after 8 iterations, you can do a grid-wide synchronization.