Issue with cooperative_groups::memcpy_async

striker159 · August 24, 2021, 2:16pm

In a program I need to copy a char buffer of N elements from 4-byte aligned shared memory to 4-byte aligned global memory. For efficient copy, as many 4-byte copies (ints) as possible are performed. the remainder is copied byte per byte.

I want to rewrite this code with memcpy_async for readability. memcpy_async allows to specify a compile-time pointer alignment, so ideally the 4-byte alignment should be specified. However, with the alignment x specified, only (N / x) * x seem to be copied.
Is this intended? Am I using it wrong? The programming guide is not very clear to me on this topic (C.6.2.1) Programming Guide :: CUDA Toolkit Documentation
For example, usage 2 mentions a copy of N elements, and then uses N as a template parameter for aligned_size_t.

The following code shows the problem. 5 chars should be copied from smem to gmem. The kernel which uses aligned_size_t<4> (kernel1) only copies 4 chars.

#include <iostream>
#include <algorithm>
#include <cassert>

#include <cooperative_groups.h>
#include <cooperative_groups/memcpy_async.h>
#include <cuda/barrier>

namespace cg = cooperative_groups;

__global__ void kernel0(char* a, const char* b, int N){
    __shared__ char smem[1024];

    auto g = cg::this_thread_block();

    for(int i = g.thread_rank(); i < N; i += g.size()) {
        smem[i] = b[i];
    }

    g.sync();

    cg::memcpy_async(
        g,
        a,
        smem,
        N
    );

    cg::wait(g);
}

__global__ void kernel1(char* a, const char* b, int N){
    __shared__ char smem[1024];

    auto g = cg::this_thread_block();

    for(int i = g.thread_rank(); i < N; i += g.size()) {
        smem[i] = b[i];
    }

    g.sync();

    cg::memcpy_async(
        g,
        a,
        cuda::aligned_size_t<4>(N),
        smem,
        cuda::aligned_size_t<4>(N)
    );

    cg::wait(g);
}

__global__ void kernel2(char* a, const char* b, int N){
    __shared__ char smem[1024];
    __shared__ cuda::barrier<cuda::thread_scope_block> barrier;

    auto g = cg::this_thread_block();

    if(g.thread_rank() == 0) {
        init(&barrier, g.size());
    }

    for(int i = g.thread_rank(); i < N; i += g.size()) {
        smem[i] = b[i];
    }

    g.sync();

    cuda::memcpy_async(
        g,
        a, 
        smem, 
        cuda::aligned_size_t<4>(N),
        barrier
    );

    barrier.arrive_and_wait();
}



__global__ void kernel3(char* a, const char* b, int N){
    __shared__ char smem[1024];

    auto g = cg::this_thread_block();

    for(int i = g.thread_rank(); i < N; i += g.size()) {
        smem[i] = b[i];
    }

    g.sync();

    const int fullInts = N / sizeof(int);

    for(int i = g.thread_rank(); i < fullInts; i += g.size()) {
        ((int*)a)[i] = ((int*)smem)[i];
    }

    for(int i = g.thread_rank(); i < N - fullInts * sizeof(int); i += g.size()) {
        a[fullInts * sizeof(int) + i] 
            = smem[fullInts * sizeof(int) + i];
    }  
}

int main(){
    const int N = 5;

    char* input; cudaMallocHost(&input, 1024);
    char* output0; cudaMallocHost(&output0, 1024);
    char* output1; cudaMallocHost(&output1, 1024);
    char* output2; cudaMallocHost(&output2, 1024);
    char* output3; cudaMallocHost(&output3, 1024);

    std::fill(input, input + 1024, 'A');
    std::fill(output0, output0 + 1024, 'B');
    std::fill(output1, output1 + 1024, 'B');
    std::fill(output2, output2 + 1024, 'B');
    std::fill(output3, output3 + 1024, 'B');

    kernel0<<<1, 256>>>(output0, input, N);
    kernel1<<<1, 256>>>(output1, input, N);
    kernel2<<<1, 256>>>(output2, input, N);
    kernel3<<<1, 256>>>(output3, input, N);
    auto status = cudaDeviceSynchronize();
    assert(status == cudaSuccess);

    std::cerr << "output0: \n";
    for(int i = 0; i < N; i++){
        std::cerr << output0[i];
    }
    std::cerr << "\n";

    std::cerr << "output1: \n";
    for(int i = 0; i < N; i++){
        std::cerr << output1[i];
    }
    std::cerr << "\n";

    std::cerr << "output2: \n";
    for(int i = 0; i < N; i++){
        std::cerr << output2[i];
    }
    std::cerr << "\n";

    std::cerr << "output3: \n";
    for(int i = 0; i < N; i++){
        std::cerr << output3[i];
    }
    std::cerr << "\n";
}

Output on my machine with sm_70:

output0: 
AAAAA
output1: 
AAAAB
output2: 
AAAAA
output3: 
AAAAA

Robert_Crovella · August 25, 2021, 1:54pm

My suggestion would be to file a bug. It’s not obvious to me that the kernel1 case/usage should have any limitations compared to the kernel0 case/usage, for the example you have given.

If you file a bug, I would expect either a technical issue to be surfaced, or else expose a need to improve the docs.

I could mention a few other things; you probably know them already:

the underlying hardware accelerated transfer only applies to 4-byte quantities.
the underlying hardware accelerated transfer only applies to cc 8.0, not 7.0
the underlying hardware accelerated transfer only applies to the global->smem path

The net of that is for the case you have given, I think there should be no difference by using the kernel0 case as a WAR.

striker159 · August 31, 2021, 7:42am

Yes, I am familiar with the limitations on hardware accelerations.

I have filed a bug. It turns out kernel1 contains undefined behaviour. According to the libcudacxx documentation
cuda::aligned_size_t - libcu++ the constructor argument of cuda::aligned_size_t does not only specify the number of bytes to copy, it also has to be a multiple of the specified alignment parameter. This is not the case when copying 5 bytes with alignment 4.

The documentation in the programming guide will be improved in the future.

Yuki_Ni · November 5, 2021, 6:39am

I’ll help add the ticket result here for more customer’s visibility .
aligned_size_t is meant indicate that size specified is divisible by the alignment (in addition to the pointers being aligned) as specified here: cuda::aligned_size_t - libcu++

We will make the document clearer on this part like ,

Errata
The memcpy_async API introduced in CUDA 11.1 with both src and dst input layouts, expects the layout to be provided in elements rather than bytes. The element type is inferred from TyElem and has the size sizeof(TyElem). If cuda::aligned_size_t<N> type is used as the layout, the number of elements specified times sizeof(TyElem) must be a multiple of N and it is recommended to use std::byte or char as the element type.

If specified shape or layout of the copy is of type cuda::aligned_size_t<N>, alignment will be guaranteed to be at least min(16, N). In that case both dst and src pointers need to be aligned to N bytes and the number of bytes copied needs to be a multiple of N.

system · November 30, 2021, 3:34am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using memcpy_async in matrix transpose CUDA Programming and Performance cuda , ampere	4	1115	February 10, 2024
Problem with programming guide example async memory copying CUDA Programming and Performance	7	3705	July 23, 2008
How to solve the problem "0: copyin MemcpyAsync (dev=0x704b27e00, host=0x203707e00, size=32000) CUDA Programming and Performance	2	747	April 3, 2016
Memcpy_async() to host memory CUDA Programming and Performance	4	428	February 12, 2024
__pipeline_memcpy_async VS cuda::memcpy_async CUDA Programming and Performance cuda	0	105	November 26, 2024
How to solve the problem "0: copyout MemcpyAsync (host=0x485360af8, dev=0x986780af8, size=32000 CUDA Programming and Performance	3	884	March 31, 2016
Fine-grained Address Control in cooperative_groups::memcpy_async CUDA Developer Tools	0	558	February 18, 2021
Coalesced and conflict free memory access using cuda::memcpy_async/cp.async CUDA Programming and Performance cuda	6	747	November 13, 2024
Bug for __pipeline_memcpy_async CUDA Programming and Performance	2	1896	October 12, 2021
Implementation of a poorly-aligned-memory, on-device std::copy/memcpy-like function? CUDA Programming and Performance	7	1667	April 10, 2017

Issue with cooperative_groups::memcpy_async

Related topics