Nvshmem_g bandwidth test result differs by data type

I am running NVSHMEM 3.3.9’s perftest shmem_g_bw in an intranode scenario, with all default parameters. The maximal bandwidth is about 50GB/s; however, if I add -d double, then it gets to 90GB/s. Why?

$ srun --partition=debug --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 --mpi=pmix --environment=cscs-nv-hpc-bench -A a-g200 bash -c ‘NVSHMEM_BOOTSTRAP_PMI=PMIX /nvshmem_src/build/perftest/device/pt-to-pt/shmem_g_bw -e 32M’
Runtime options after parsing command line arguments
min_size: 4, max_size: 33554432, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, tpb_sweep: 0, cta_sweep: 0, size_per_thread: 4096, mem_handle_type: 0, use_egm: 0
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 1 mype_node: 1 device name: NVIDIA GH200 120GB bus id: 1Runtime options after parsing command line arguments
min_size: 4, max_size: 33554432, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, tpb_sweep: 0, cta_sweep: 0, size_per_thread: 4096, mem_handle_type: 0, use_egm: 0
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA GH200 120GB bus id: 1
#shmem_g_bw
size (Bytes) scope BW (GB/sec) CTAs thread/CTA
4 None 0.001476 32 256
8 None 0.003243 32 256
16 None 0.006849 32 256
32 None 0.013532 32 256
64 None 0.027322 32 256
128 None 0.054348 32 256
256 None 0.109589 32 256
512 None 0.216802 32 256
1024 None 0.435374 32 256
2048 None 0.881543 32 256
4096 None 1.780250 32 256
8192 None 3.473541 32 256
16384 None 6.580977 32 256
32768 None 12.784020 32 256
65536 None 20.500500 32 256
131072 None 28.804501 32 256
262144 None 36.555107 32 256
524288 None 42.270382 32 256
1048576 None 46.459663 32 256
2097152 None 48.534397 32 256
4194304 None 49.744579 32 256
8388608 None 50.327141 32 256
16777216 None 50.745079 32 256
33554432 None 50.870136 32 256
[clariden][yijunma@clariden-ln001 build]$ srun --partition=debug --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 --mpi=pmix --environment=cscs-nv-hpc-bench -A a-g200 bash -c ‘NVSHMEM_BOOTSTRAP_PMI=PMIX /nvshmem_src/build/perftest/device/pt-to-pt/shmem_g_bw -d double -e 32M’
Runtime options after parsing command line arguments
min_size: 4, max_size: 33554432, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: double, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, tpb_sweep: 0, cta_sweep: 0, size_per_thread: 4096, mem_handle_type: 0, use_egm: 0
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 1 mype_node: 1 device name: NVIDIA GH200 120GB bus id: 1
Runtime options after parsing command line arguments
min_size: 4, max_size: 33554432, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: double, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, tpb_sweep: 0, cta_sweep: 0, size_per_thread: 4096, mem_handle_type: 0, use_egm: 0
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA GH200 120GB bus id: 1
#shmem_g_bwsize (Bytes) scope BW (GB/sec) CTAs thread/CTA
4 None 0.002013 32 256
8 None 0.003342 32 256
16 None 0.006435 32 256
32 None 0.014225 32 256
64 None 0.028409 32 256
128 None 0.056577 32 256
256 None 0.113960 32 256
512 None 0.224090 32 256
1024 None 0.451977 32 256
2048 None 0.883978 32 256
4096 None 1.777778 32 256
8192 None 3.482993 32 256
16384 None 7.042641 32 256
32768 None 12.816021 32 256
65536 None 23.167421 32 256
131072 None 37.509155 32 256
262144 None 53.160290 32 256
524288 None 67.451622 32 256
1048576 None 80.294044 32 256
2097152 None 86.710770 32 256
4194304 None 90.682159 32 256
8388608 None 92.981941 32 256
16777216 None 93.923073 32 256
33554432 None 94.402519 32 256

Thank you for your question. First, you have formatted your perf results as a single line, which makes it hard to read them. Could you re-format them as a block?

Regarding the g_bw performance, a few things could be going on.

Calls to nvshmem_g will result in memory transactions going out over the NVLink (since you mentioned this is intra-node, and I can see from the logs that this is a GH200 system). Let’s imagine each of your default (4-byte float) sized transactions are accessing unaligned (non-consecutive) addresses. Then, each 4-byte load has to go out as an individual memory transaction. And by way of doubling the data type size, the same number of transactions will result in more data being moved. The real limiting factor in that case would be the memory transaction rate that NVLink/memory system can handle.

For threads in a warp, if they happen to be accessing contiguous blocks of memory, the memory system is able to send them out as coalesced transactions which will end up being more efficient.

As a follow-up question, is there a reason that you are using g for bandwidth-bound scenarios rather than getmem_[thread|block|warp] ?

Thank you for your replies. And just fixed the missing newlines manually.

I am just running the nvshmem_g bandwidth test from the NVSHMEM open source package https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_cuda12-all , to learn about NVSHMEM’s performance characteristics. It seemed odd to me that the performance varies with datatype.

And yes, thank you for your explanations - this perftest is having all its threads sending out individual read requests, and there are no software-level coalesces.

template <typename T>
__global__ void bw(T *data_d, volatile unsigned int *counter_d, int len, int pe, int iter,
                   int stride) {
    int u, i, j, peer, tid, slice;
    unsigned int counter;
    int threads = gridDim.x * blockDim.x;
    tid = blockIdx.x * blockDim.x + threadIdx.x;

    peer = !pe;
    slice = UNROLL * threads * stride;

    // When stride > 1, each iteration requests less than len elements.
    // We increase the number of iterations to make up for that.
    for (i = 0; i < iter * stride; i++) {
        for (j = 0; j < len - slice; j += slice) {
            for (u = 0; u < UNROLL; ++u) {
                int idx = j + u * threads + tid * stride;
                *(data_d + idx) = call_nvshmem_g<T>(data_d + idx, peer);
            }
            __syncthreads(); /* This is required for performance over PCIe. PCIe has a P2P mailbox
                                protocol that has a window of 64KB for device BAR addresses. Not
                                synchronizing
                                across threads will lead to jumping in and out of the 64K window */
        }

        for (u = 0; u < UNROLL; ++u) {
            int idx = j + u * threads + tid * stride;
            if (idx < len) *(data_d + idx) = call_nvshmem_g<T>(data_d + idx, peer);
        }

        ...
}
1 Like