Global Barrier synchronization

Hi, i have a kernel like this :

template <typename T2>
__global__ void TRANSPOSE::__chunk_transpose__(T2 *matrix_data, int64 Nx, int64 Ny, int64 Nz, int procs)
{
    int64 i = (blockDim.x * blockIdx.x) + threadIdx.x;

    if (i >= ((Nx / procs) * (Ny) * (Nz / 2 + 1)))
        return;

    // Load the local data
    T2 data = matrix_data[i];
// Some form of synchronization
    int d_tmp_z = (Ny / procs);
    int Nx_no = i % (Nz / 2 + 1);
    int Ny_no = (i / (Nz / 2 + 1)) % (Nx / procs);
    int Nz_no = i / ((Nz / 2 + 1) * (Nx / procs));
    int odd_even = Nz_no / d_tmp_z;
    int put_odd_even = Nz_no % d_tmp_z;
    int64 put_no_slab = (odd_even * (Nx / procs) * (Nz / 2 + 1) * (Ny / procs)) + (put_odd_even * (Nz / 2 + 1)) + (Ny_no * (Nz / 2 + 1) * (Ny) / procs);
    int64 put_no_full = put_no_slab + Nx_no;

    matrix_data[put_no_full] = data;
}

as you can see here i want to read all the data from global memory to local memory and then synchronize all threads in all grids and then write some other data in those same memory locations.

How can i do this please suggest???

You can do a grid-wide sync using cooperative groups, if you are on a pascal or newer GPU. The methodology is covered in the programming guide, as well as various forum posts, for example this one.

i did a grid sync via :

template <typename T2>
__global__ void TRANSPOSE::__chunk_transpose__(T2 *matrix_data, int64 Nx, int64 Ny, int64 Nz, int procs)
{
    int64 i = (blockDim.x * blockIdx.x) + threadIdx.x;

    if (i >= ((Nx / procs) * (Ny) * (Nz / 2 + 1)))
        return;


    auto grid = cooperative_groups::this_grid();

    // Load the local data
    T2 data = matrix_data[i];
    __threadfence_system();
    grid.sync();

    int d_tmp_z = (Ny / procs);
    int Nx_no = i % (Nz / 2 + 1);
    int Ny_no = (i / (Nz / 2 + 1)) % (Nx / procs);
    int Nz_no = i / ((Nz / 2 + 1) * (Nx / procs));
    int odd_even = Nz_no / d_tmp_z;
    int put_odd_even = Nz_no % d_tmp_z;
    int64 put_no_slab = (odd_even * (Nx / procs) * (Nz / 2 + 1) * (Ny / procs)) + (put_odd_even * (Nz / 2 + 1)) + (Ny_no * (Nz / 2 + 1) * (Ny) / procs);
    int64 put_no_full = put_no_slab + Nx_no;

    matrix_data[put_no_full] = data;
}

and launched the kernel using

    cudaLaunchCooperativeKernel((void *)(TRANSPOSE::__chunk_transpose__<T2>), __grid_fourier_space__, __block_fourier_space__, kernel_args, 0, 0);

still not working

As I understand you need a grid wide sync, because you read and then write to the same memory locations.

Can you try - just for testing - to use a different memory area for the input and output matrix (out-of-place)? With this method you can prove, whether the issue is the sync or the index calculations.

BTW For later: You are using lots of divisions and modulo operations. For improving performance, you could consider making the dimensions Nx, Ny, Nz and procs constants (e.g. as non-type template parameter).

Yes , I have tried by out of place operation also . The kernel works fine for it. But it gives wrong output for in place as shown above.

Does the whole grid fit onto your GPU concurrently? I.e. blocks per SM * number of SMs <= number of blocks?

Yes, it fits

For testing, whether the sync worked, you could increase a global atomic variable once for each warp or block after reading/before sync and read it after the sync/before writing. It should be the same value (number of warps or blocks) at that point.

I didnt get it , Can you explain via an example

Something like it. You count in the global variable finishedReading (which you initialize to 0 before the kernel call), how often you have already read, and after the synchronization you test, whether all were finished at that time.

I tried this ::

template <typename T2>
__global__ void TRANSPOSE::__chunk_transpose__(T2 *matrix_data, int64 Nx, int64 Ny, int64 Nz, int procs, int *at_array)
{
    auto grid = cooperative_groups::this_grid();

    int64 i = (blockDim.x * blockIdx.x) + threadIdx.x;

    if (i >= ((Nx / procs) * (Ny) * (Nz / 2 + 1)))
        return;

    // Load the local data
    T2 data = matrix_data[i];

    if(threadIdx.x == 0)
    {
        atomicAdd(&(at_array[0]),1);
    }
    __threadfence_system();
    grid.sync();

    int finishedSoFar = *at_array;

    int d_tmp_z = (Ny / procs);
    int Nx_no = i % (Nz / 2 + 1);
    int Ny_no = (i / (Nz / 2 + 1)) % (Nx / procs);
    int Nz_no = i / ((Nz / 2 + 1) * (Nx / procs));
    int odd_even = Nz_no / d_tmp_z;
    int put_odd_even = Nz_no % d_tmp_z;
    int64 put_no_slab = (odd_even * (Nx / procs) * (Nz / 2 + 1) * (Ny / procs)) + (put_odd_even * (Nz / 2 + 1)) + (Ny_no * (Nz / 2 + 1) * (Ny) / procs);
    int64 put_no_full = put_no_slab + Nx_no;

    __syncthreads();
        
    if(i == 0)
    {
        printf("total value of grid == %d",finishedSoFar);
    }

    matrix_data[put_no_full] = data;
}

But it is not printing anything at all

My collective launch is like this

    void *kernel_args[] = {&__buffer__<T2>, &__Nx__, &__Ny__, &__Nz__, &__procs__, &atomic_add_temp_array};
    cudaLaunchCooperativeKernel((void *)(TRANSPOSE::__chunk_transpose__<T2>), __grid_fourier_space__, __block_fourier_space__, kernel_args);
    TRANSITIONS::__Device_synchronize__();

There could be a potential error or exception.

Have you checked for the return type of cudaLaunchCooperativeKernel?
With afterwards cudaDeviceSynchronize() and cudaGetLastError()?

The only other reason would be that your kernel finishes early with the if (i >= instruction even for i == 0.

Do you have a Cuda debugger available, which breaks at errors?

Yes, i’m getting the error "
cuda error = too many blocks in cooperative launch"

Which GPU / how many SMs are you using?
How many of your blocks can be resident at a SM at the same time?
What is the value of your __grid_fourier_space__?
How much smaller must your grid size be to be accepted?

The programming guide shows how to programmatically choose the number of blocks to fit.

I’m using A100 GPU (40 GB).
grid_fourier_space = (512 * 257) + 1
block_fourier_space = 256

The A100 has 108 SMs and can keep between 1 and 32 blocks per SM (depending on their resource requirements). With 131585 blocks you are way beyond that maximum.

So not all blocks will be loaded at the same time onto the GPU. But that is necessary for global synching with cudaLaunchCooperativeKernel. So either you reduce the number of blocks (and e.g. introduce for loops within your kernel; that could be only difficult, if you need lots of shared memory per block, otherwise it is the nearly trivial solution) or you change your algorithm that you won`t have to synch the whole grid.

In your specific case you have the difficulty that you would need separate for loops before and after the sync. That means your local data would not just be a single T2, but a lot of them. You want to keep local data within registers for best performance. Registers cannot be dynamically indexed, so you could unroll the for loops (#pragma unroll) and make the number of iterations constant to make the indices static.

Do you really need a transpose function for a huge matrix?
The best practice is to store in a flag, whether the matrix has to be interpreted as transposed and access it for any future operation one way or the other.

That should bei >=. Sorry for any misunderstanding.

No, sadly it does not fit.

isnt there any way of doing it without lopping or anything like that. Like i want to do something like cudaDevicesynchronize() but within the kernel

You are using the same memory for reading the matrix from and writing the transposed matrix to.
If you keep the algorithm, you need to read the full matrix, before starting to write.
You have to store the intermediate results somewhere.
So all blocks have to be running.
The maximum number of blocks running is limited.
Only if you stay below the maximum number of blocks, you can do a grid-wide sync.

Normal kernel calls would run some blocks sequentially. That is not possible, when syncing. Otherwise the syncing would not achieve, what you want. You cannot pause some kernels and resume them.

Sometimes it is acceptable to split into two kernels. But that is not possible here, as you want to keep (local) data.

So you have the following options (choose either):

  • do not use transpose, but use a flag signifying transposed state for the next operations on the matrix
  • change the algorithm to one, which does not need syncing, e.g. one working with swapping elements along the diagonal
  • cut your matrix into smaller independent (pairs of) rectangles (similar to the swapping algorithm)
  • reduce the number of blocks; use for loops and more local (or shared memory) data
  • use two memory arrays for input and output (out-of-place instead of in-place computation)
  • at the point of sync, store all local data and shared memory to global memory; leave the kernel; do a CPU-wide sync; restart a new kernel, which loads the intermediate data from global memory

It is not a shortcoming or missing feature of Cuda. It logically is not possible to sync grid-wide with more blocks than can run simultaneously.