Global Barrier synchronization

manver · May 12, 2024, 2:01pm

Hi, i have a kernel like this :

template <typename T2>
__global__ void TRANSPOSE::__chunk_transpose__(T2 *matrix_data, int64 Nx, int64 Ny, int64 Nz, int procs)
{
    int64 i = (blockDim.x * blockIdx.x) + threadIdx.x;

    if (i >= ((Nx / procs) * (Ny) * (Nz / 2 + 1)))
        return;

    // Load the local data
    T2 data = matrix_data[i];
// Some form of synchronization
    int d_tmp_z = (Ny / procs);
    int Nx_no = i % (Nz / 2 + 1);
    int Ny_no = (i / (Nz / 2 + 1)) % (Nx / procs);
    int Nz_no = i / ((Nz / 2 + 1) * (Nx / procs));
    int odd_even = Nz_no / d_tmp_z;
    int put_odd_even = Nz_no % d_tmp_z;
    int64 put_no_slab = (odd_even * (Nx / procs) * (Nz / 2 + 1) * (Ny / procs)) + (put_odd_even * (Nz / 2 + 1)) + (Ny_no * (Nz / 2 + 1) * (Ny) / procs);
    int64 put_no_full = put_no_slab + Nx_no;

    matrix_data[put_no_full] = data;
}

as you can see here i want to read all the data from global memory to local memory and then synchronize all threads in all grids and then write some other data in those same memory locations.

How can i do this please suggest???

Robert_Crovella · May 12, 2024, 8:17pm

You can do a grid-wide sync using cooperative groups, if you are on a pascal or newer GPU. The methodology is covered in the programming guide, as well as various forum posts, for example this one.

manver · May 13, 2024, 4:06am

i did a grid sync via :

template <typename T2>
__global__ void TRANSPOSE::__chunk_transpose__(T2 *matrix_data, int64 Nx, int64 Ny, int64 Nz, int procs)
{
    int64 i = (blockDim.x * blockIdx.x) + threadIdx.x;

    if (i >= ((Nx / procs) * (Ny) * (Nz / 2 + 1)))
        return;


    auto grid = cooperative_groups::this_grid();

    // Load the local data
    T2 data = matrix_data[i];
    __threadfence_system();
    grid.sync();

    int d_tmp_z = (Ny / procs);
    int Nx_no = i % (Nz / 2 + 1);
    int Ny_no = (i / (Nz / 2 + 1)) % (Nx / procs);
    int Nz_no = i / ((Nz / 2 + 1) * (Nx / procs));
    int odd_even = Nz_no / d_tmp_z;
    int put_odd_even = Nz_no % d_tmp_z;
    int64 put_no_slab = (odd_even * (Nx / procs) * (Nz / 2 + 1) * (Ny / procs)) + (put_odd_even * (Nz / 2 + 1)) + (Ny_no * (Nz / 2 + 1) * (Ny) / procs);
    int64 put_no_full = put_no_slab + Nx_no;

    matrix_data[put_no_full] = data;
}

and launched the kernel using

    cudaLaunchCooperativeKernel((void *)(TRANSPOSE::__chunk_transpose__<T2>), __grid_fourier_space__, __block_fourier_space__, kernel_args, 0, 0);

still not working

Curefab · May 13, 2024, 9:25am

As I understand you need a grid wide sync, because you read and then write to the same memory locations.

Can you try - just for testing - to use a different memory area for the input and output matrix (out-of-place)? With this method you can prove, whether the issue is the sync or the index calculations.

BTW For later: You are using lots of divisions and modulo operations. For improving performance, you could consider making the dimensions Nx, Ny, Nz and procs constants (e.g. as non-type template parameter).

manver · May 13, 2024, 11:09am

Yes , I have tried by out of place operation also . The kernel works fine for it. But it gives wrong output for in place as shown above.

Curefab · May 13, 2024, 12:33pm

Does the whole grid fit onto your GPU concurrently? I.e. blocks per SM * number of SMs <= number of blocks?

manver · May 13, 2024, 12:35pm

Yes, it fits

Curefab · May 13, 2024, 12:49pm

For testing, whether the sync worked, you could increase a global atomic variable once for each warp or block after reading/before sync and read it after the sync/before writing. It should be the same value (number of warps or blocks) at that point.

manver · May 13, 2024, 12:50pm

I didnt get it , Can you explain via an example

Curefab · May 13, 2024, 12:58pm

manver:

    T2 data = matrix_data[i];
    if (threadIdx.x == 0)
        atomicAdd(finishedReading, 1);
    __threadfence_system();
    grid.sync();
    int finishedSoFar = *finishedReading;
    // ...
    matrix_data[put_no_full] = finishedSoFar;

Something like it. You count in the global variable finishedReading (which you initialize to 0 before the kernel call), how often you have already read, and after the synchronization you test, whether all were finished at that time.

manver · May 13, 2024, 1:27pm

I tried this ::

template <typename T2>
__global__ void TRANSPOSE::__chunk_transpose__(T2 *matrix_data, int64 Nx, int64 Ny, int64 Nz, int procs, int *at_array)
{
    auto grid = cooperative_groups::this_grid();

    int64 i = (blockDim.x * blockIdx.x) + threadIdx.x;

    if (i >= ((Nx / procs) * (Ny) * (Nz / 2 + 1)))
        return;

    // Load the local data
    T2 data = matrix_data[i];

    if(threadIdx.x == 0)
    {
        atomicAdd(&(at_array[0]),1);
    }
    __threadfence_system();
    grid.sync();

    int finishedSoFar = *at_array;

    int d_tmp_z = (Ny / procs);
    int Nx_no = i % (Nz / 2 + 1);
    int Ny_no = (i / (Nz / 2 + 1)) % (Nx / procs);
    int Nz_no = i / ((Nz / 2 + 1) * (Nx / procs));
    int odd_even = Nz_no / d_tmp_z;
    int put_odd_even = Nz_no % d_tmp_z;
    int64 put_no_slab = (odd_even * (Nx / procs) * (Nz / 2 + 1) * (Ny / procs)) + (put_odd_even * (Nz / 2 + 1)) + (Ny_no * (Nz / 2 + 1) * (Ny) / procs);
    int64 put_no_full = put_no_slab + Nx_no;

    __syncthreads();
        
    if(i == 0)
    {
        printf("total value of grid == %d",finishedSoFar);
    }

    matrix_data[put_no_full] = data;
}

But it is not printing anything at all

My collective launch is like this

    void *kernel_args[] = {&__buffer__<T2>, &__Nx__, &__Ny__, &__Nz__, &__procs__, &atomic_add_temp_array};
    cudaLaunchCooperativeKernel((void *)(TRANSPOSE::__chunk_transpose__<T2>), __grid_fourier_space__, __block_fourier_space__, kernel_args);
    TRANSITIONS::__Device_synchronize__();

Curefab · May 13, 2024, 1:48pm

There could be a potential error or exception.

Have you checked for the return type of cudaLaunchCooperativeKernel?
With afterwards cudaDeviceSynchronize() and cudaGetLastError()?

The only other reason would be that your kernel finishes early with the if (i >= instruction even for i == 0.

Do you have a Cuda debugger available, which breaks at errors?

manver · May 13, 2024, 1:58pm

Yes, i’m getting the error "
cuda error = too many blocks in cooperative launch"

Curefab · May 13, 2024, 2:05pm

Which GPU / how many SMs are you using?
How many of your blocks can be resident at a SM at the same time?
What is the value of your __grid_fourier_space__?
How much smaller must your grid size be to be accepted?

Robert_Crovella · May 13, 2024, 2:12pm

The programming guide shows how to programmatically choose the number of blocks to fit.

manver · May 14, 2024, 8:13am

I’m using A100 GPU (40 GB).
grid_fourier_space = (512 * 257) + 1
block_fourier_space = 256

Curefab · May 14, 2024, 9:48am

The A100 has 108 SMs and can keep between 1 and 32 blocks per SM (depending on their resource requirements). With 131585 blocks you are way beyond that maximum.

So not all blocks will be loaded at the same time onto the GPU. But that is necessary for global synching with cudaLaunchCooperativeKernel. So either you reduce the number of blocks (and e.g. introduce for loops within your kernel; that could be only difficult, if you need lots of shared memory per block, otherwise it is the nearly trivial solution) or you change your algorithm that you won`t have to synch the whole grid.

In your specific case you have the difficulty that you would need separate for loops before and after the sync. That means your local data would not just be a single T2, but a lot of them. You want to keep local data within registers for best performance. Registers cannot be dynamically indexed, so you could unroll the for loops (#pragma unroll) and make the number of iterations constant to make the indices static.

Do you really need a transpose function for a huge matrix?
The best practice is to store in a flag, whether the matrix has to be interpreted as transposed and access it for any future operation one way or the other.

That should bei >=. Sorry for any misunderstanding.

No, sadly it does not fit.

manver · May 14, 2024, 11:08am

isnt there any way of doing it without lopping or anything like that. Like i want to do something like cudaDevicesynchronize() but within the kernel

Curefab · May 14, 2024, 11:20am

You are using the same memory for reading the matrix from and writing the transposed matrix to.
If you keep the algorithm, you need to read the full matrix, before starting to write.
You have to store the intermediate results somewhere.
So all blocks have to be running.
The maximum number of blocks running is limited.
Only if you stay below the maximum number of blocks, you can do a grid-wide sync.

Normal kernel calls would run some blocks sequentially. That is not possible, when syncing. Otherwise the syncing would not achieve, what you want. You cannot pause some kernels and resume them.

Sometimes it is acceptable to split into two kernels. But that is not possible here, as you want to keep (local) data.

So you have the following options (choose either):

do not use transpose, but use a flag signifying transposed state for the next operations on the matrix
change the algorithm to one, which does not need syncing, e.g. one working with swapping elements along the diagonal
cut your matrix into smaller independent (pairs of) rectangles (similar to the swapping algorithm)
reduce the number of blocks; use for loops and more local (or shared memory) data
use two memory arrays for input and output (out-of-place instead of in-place computation)
at the point of sync, store all local data and shared memory to global memory; leave the kernel; do a CPU-wide sync; restart a new kernel, which loads the intermediate data from global memory

It is not a shortcoming or missing feature of Cuda. It logically is not possible to sync grid-wide with more blocks than can run simultaneously.